Home
>
Computer Vision
>
AI for Video Understanding: From Content Moderation to Summarization

Bhomik Sharma
on August 19, 2025

AI for Video Understanding: From Content Moderation to Summarization

The rapid growth of video content has created a need for advanced systems to process and understand this complex data. Video understanding is a critical field in AI, where the goal is to enable machines to comprehend, analyze, and interact with video content, much like humans. In this post, we’ll

Computer Vision, Generative AI, Video Analysis, Vision Language Models

The rapid growth of video content has created a need for advanced systems to process and understand this complex data. Video understanding is a critical field in AI, where the goal is to enable machines to comprehend, analyze, and interact with video content, much like humans.

In this post, we’ll build and analyze two practical pipelines:

Video Content Moderation with CLIP + Gemini
Video Summarization with Qwen2.5-VL

Finally, we’ll show the flow of pipeline needed for one end-to-end solution: first ensuring videos are safe, then automatically summarizing them. This aligns perfectly with real-world use cases in social media, e-learning, corporate training, and compliance.

The Model Foundations: CLIP, Gemini, and Qwen
Part 1: Video Content Moderation with CLIP and Gemini
Part 2: Video Summarization with Qwen2.5-VL
Part 3: Combining Moderation + Summarization
Results
Conclusion
References

The Model Foundations: CLIP, Gemini, and Qwen

CLIP (Contrastive Language–Image Pretraining)

Video Understanding, CLIP, Gemini, Qwen-2.5-VL, Content Moderation, Video Summarization, contrastive learning

CLIP, introduced by OpenAI, is trained on 400M image-text pairs. It aligns images and text in the same embedding space:

If an image and caption match, their embeddings are close.
If they don’t match, they are pushed apart.

This enables zero-shot classification: you provide natural language prompts (e.g., “this image contains nudity”), and CLIP compares them with the image. No task-specific fine-tuning is needed.

Strengths for moderation:

Add/remove categories easily by editing prompts.
Generalizes to new scenarios without retraining.
Efficient at inference time.

Gemini

Gemini is Google’s family of large multimodal models. In our moderation pipeline, Gemini is used not to classify but to explain why content was flagged. This explanation layer improves trust, auditability, and regulatory compliance.

Qwen2.5-VL

Qwen2.5-VL, developed by Alibaba, is a vision-language model optimized for multimodal tasks. Unlike CLIP (alignment), Qwen is a generative model: it takes video input and outputs structured summaries.

Key capabilities:

Handles videos directly (via sampled frames).
Instruction following behavior ensures controllable outputs.
Can produce both bullet points and paragraph summaries.

Together, these models cover the detect → explain → summarize pipeline we need.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Part 1: Video Content Moderation with CLIP and Gemini

We begin with the moderation script. This script extracts frames from a video, classifies each using CLIP, and then uses Gemini to generate a concise explanation for unsafe content.

Imports

import torch
import gradio as gr
from PIL import Image
import cv2
import google.generativeai as genai
from transformers import CLIPProcessor, CLIPModel
import json

torch: tensor operations, GPU acceleration.
gradio: interactive interface.
PIL.Image: image representation.
cv2: frame extraction from videos.
google.generativeai: Gemini client.
CLIPProcessor, CLIPModel: Hugging Face model and processor for CLIP.
json: structured output for flagged frames.

genai.configure(api_key="...")

Configures Gemini with an API key. In production, store this in environment variables( a .env file ).

Loading CLIP

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Loads the CLIP ViT-B/32 model.
processor tokenizes text prompts and preprocesses images.

Extracting Frames

def extract_frames(video_path):
    cap = cv2.VideoCapture(video_path)
    frames= []
    idx = 0
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frames.append((idx, Image.fromarray(rgb)))
        idx += 1
    cap.release()
    return frames

Opens a video file with OpenCV.
Reads each frame, converts to RGB, wraps as PIL.Image.
Stores as (frame_index, image) pairs.
Returns full list of frames.

Frame Classification with CLIP

def classify_frame_with_clip(frame: Image.Image):
    text_inputs = [
        "this image is normal",
        "this image contains nudity",
        "this image contains enticing or sensual content",
        "this image contains violence"
    ]
    inputs = processor(text=text_inputs, images=frame, return_tensors="pt", padding=True)
    with torch.no_grad():
        output = model(**inputs)
    probs = output.logits_per_image.softmax(dim=-1)
    confidence, predicted_class = probs.max(dim=-1)
    labels = ["normal", "nudity", "enticing or sensual", "violent"]
    return labels[predicted_class.item()], confidence.item()

Defines candidate labels as natural language.
Processor encodes text + frame.
CLIP computes similarity logits.
softmax converts logits to probabilities.
Returns the most probable label and its confidence.

Explanations with GEMINI

def get_gemini_explanation(images: list, flagged_frames_: list) -> str:
    llm = genai.GenerativeModel("gemini-2.5-flash")
    image_parts = []
    for img in images:
        buffer = io.BytesIO()
        img.save(buffer, format="JPEG")
        image_parts.append({"mime_type": "image/jpeg", "data": buffer.getvalue()})

    prompt = f"""
        You are given a set of image frames ... {json.dumps(flagged_frames_, indent=2)}
        Analyze the flagged images and explain clearly in no more than 2 lines.
    """
    response = llm.generate_content([prompt] + image_parts)
    return response.parts[0].text.strip()

Converts flagged frames into JPEG bytes.
Builds a prompt with metadata + images.
Gemini generates a concise explanation.

Moderation Orchestration

def moderate_video(video_file):
    frames = extract_frames(video_file)
    flagged_frames, img_list = [], []
    for idx, img in frames:
        label, confidence = classify_frame_with_clip(img)
        if label != "normal" and confidence > 0.5:
            flagged_frames.append({"frame_id": idx, "classification": label, "confidence": confidence})
            img_list.append(img)
    if not flagged_frames:
        return "All frames are appropriate.", "None", {}
    explanation = get_gemini_explanation(img_list, flagged_frames)
    final_perct = (len(img_list)/len(frames))*100
    return explanation, final_perct, flagged_frames

Extracts frames.
Classifies each frame.
Stores flagged frames above confidence threshold.
If unsafe, asks Gemini for explanation.
Computes flagged percentage.

Gradio Interface

iface = gr.Interface(
    fn=moderate_video,
    inputs=gr.Video(label="Upload a Video"),
    outputs=[
        gr.Textbox(label="Why is it Flagged?"),
        gr.Textbox(label="NSFW Percentage"),
        gr.JSON(label="Flagged Frames")
    ],
    title="Video Content Moderation with CLIP and Gemini",
    description="Detects unsafe content in a video and provides reasoning."
)
iface.launch(share=True)

Interactive demo with video upload.
Shows explanation, percentage, and JSON.

Efficiency Improvements

Currently, CLIP re-encodes text prompts for every frame. Since prompts are fixed, we can precompute embeddings:

with torch.no_grad():
    text_inputs = ["this image is normal", "this image contains nudity", "this image contains enticing or sensual content", "this image contains violence"]
    text_batch = processor(text=text_inputs, return_tensors="pt", padding=True)
    text_embeds = model.get_text_features(**text_batch)
    text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)

For each frame, compute only the image embedding, then compare with cached text embeddings. This reduces redundant compute, especially for long videos.

Part 2: Video Summarization with Qwen2.5-VL

Now that we can detect unsafe content, let’s summarize safe videos with Qwen2.5-VL.

Imports

import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info

AutoProcessor: handles multimodal input encoding.
Qwen2_5_VLForConditionalGeneration: instruction-tuned model for summarization.
process_vision_info: prepares video data for the model.

Summarization Function

def summarize_video(video_path: str, max_new_tokens: int = 256) -> str:
    model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
    processor = AutoProcessor.from_pretrained(model_id)

    vpath = Path(video_path).resolve()
    messages = [{
        "role": "user",
        "content": [
            {"type": "video", "video": f"file://{vpath}"},
            {"type": "text", "text": (
                "Summarize this video. Return ONLY valid JSON with two keys:\n"
                '"bullets": a list of 5-7 concise bullet points,\n'
                '"paragraph": a short 120-word paragraph summary.\n'
            )},
        ],
    }]

Loads Qwen model + processor.
Builds a chat-style message with video and summarization instructions.

Preprocessing and Generation

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)

trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, output_ids)]
out_text = processor.batch_decode(trimmed, skip_special_tokens=True)[0]

Converts messages into model-ready inputs.
Runs generation, trimming prompt tokens.
Decodes into raw text (expected to be JSON).

Parsing JSON

def try_parse_json(text: str):
    try:
        return json.loads(text)
    except:
        import re
        m = re.search(r"\{[\s\S]*\}", text)
        return json.loads(m.group(0)) if m else None

First attempts strict json.loads.
Falls back to regex extraction.

Writing Outputs

def write_outputs(obj_or_text, out_path: Path):
    if isinstance(obj_or_text, dict) and "bullets" in obj_or_text and "paragraph" in obj_or_text:
        bullets = obj_or_text["bullets"]
        paragraph = obj_or_text["paragraph"]

        md = ["# Video Summary", "", "## Key Points"]
        md.extend([f"- {b}" for b in bullets])
        md.append("")
        md.append("## Short Summary")
        md.append(paragraph)
        out_path.write_text("\n".join(md), encoding="utf-8")

Saves summary both as Markdown (readable) and JSON (machine-readable).

Part 3: Combining Moderation + Summarization

Combining both moderation and summarization scripts can be done as well, with the help of the flow diagram given below:

Video Understanding, CLIP, Gemini, Qwen, Content Moderation, Video Summarization

Why this Matters:

Platforms: Stop unsafe content before distribution.
Users: Quickly understand safe videos through summaries.
Businesses: Compliance + accessibility in one workflow.

Performance Considerations

Precompute text embeddings (moderation).
Sample frames at intervals (e.g., 1 fps) instead of decoding every frame.
Batch inference for CLIP.
Set strict JSON prompts for Qwen to guarantee parseable output.

Future Directions

Streaming moderation for live video.
Unified models (e.g., Qwen Omni) to handle both moderation + summarization in one step.
Customization: Define categories (political, medical, sensitive).
Scalability: Deploy with ONNX/TensorRT for efficiency.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Click here to download the source code to this post

Results

The above video showcases a gradio application created for content moderation. The three tabs beside the image input tab, displays the reason for flagging the video( if there is any flagged content ), percentage of flagged content in the whole video and the final block showing the frames which are predicted to contain NSFW content with the confidence score alongside.

The above video is used for inferencing with the script of Video summarization. The output contains two files( .md and .json ). Output summary is provided below:

{
  "bullets": [
    "Cyclists in colorful jerseys are riding in a tight formation on a city street.",
    "The cyclists are pedaling vigorously, maintaining a synchronized pace.",
    "Onlookers line the street, some taking photos and others watching the race intently.",
    "The scene is bustling with activity as the cyclists approach a curve in the road.",
    "A black Audi SUV with 'Dubai Tour' branding drives past, followed by a white police vehicle.",
    "The cyclists continue their ride, navigating the curve with precision and speed.",
    "The crowd remains engaged, capturing the moment with their cameras and phones."
  ],
  "paragraph": 
"The video captures an exhilarating cycling race through a vibrant city street, where cyclists in colorful jerseys maintain a tight formation as they navigate a curve. The scene is lively, with spectators lining the street, some taking photos and others watching the race intently. A black Audi SUV and a white police vehicle drive past, adding to the dynamic atmosphere. The cyclists continue their ride, showcasing their skill and determination as they maneuver around the curve."
}

Conclusion

We built a practical, end-to-end video AI pipeline:

Moderation with CLIP + Gemini – detects unsafe frames and explains why.
Summarization with Qwen2.5-VL – produces concise bullet points and a paragraph.
Unified pipeline – ensuring only safe content gets summarized.

By combining classification, explanation, and summarization, this system delivers safety, interpretability, and accessibility for the overwhelming volume of video content online.

This pipeline demonstrates how multimodal AI can transform content workflows, paving the way for real-world deployments in social platforms, education, and enterprise compliance.

References

CLIP Documentation: https://huggingface.co/docs/transformers/en/model_doc/clip
Google API Reference Documentation: https://ai.google.dev/api/generate-content
Qwen-2.5-VL Documentation Hugging Face: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct

Was This Article Helpful?

Deploying ML on Arduino: From Blink to Think

Deploying ML on Arduino Nano 33 BLE. Explore TinyML techniques, setup steps, and why older

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs’ ability

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a

Was This Article Helpful?

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

Agentic AIGUIVLMs

Kukil September 30, 2025

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Computer VisionVLMs

Bhomik Sharma September 23, 2025

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.

AI for Video Understanding: From Content Moderation to Summarization

The Model Foundations: CLIP, Gemini, and Qwen

CLIP (Contrastive Language–Image Pretraining)

Gemini

Qwen2.5-VL

Part 1: Video Content Moderation with CLIP and Gemini

Part 2: Video Summarization with Qwen2.5-VL

Part 3: Combining Moderation + Summarization

Results

Conclusion

References

Deploying ML on Arduino: From Blink to Think

AI Agent in Action: Automating Desktop Tasks with VLMs

Table of Contents

Read Next

VideoRAG: Redefining Long-Context Video Comprehension

AI Agent in Action: Automating Desktop Tasks with VLMs

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Subscribe to our Newsletter

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

Get Started with OpenCV

AI for Video Understanding: From Content Moderation to Summarization

The Model Foundations: CLIP, Gemini, and Qwen

CLIP (Contrastive Language–Image Pretraining)

Gemini

Qwen2.5-VL

Part 1: Video Content Moderation with CLIP and Gemini

Part 2: Video Summarization with Qwen2.5-VL

Part 3: Combining Moderation + Summarization

Results

Conclusion

References

Subscribe & Download Code

Deploying ML on Arduino: From Blink to Think

VideoRAG: Redefining Long-Context Video Comprehension

AI Agent in Action: Automating Desktop Tasks with VLMs

Table of Contents

Read Next

VideoRAG: Redefining Long-Context Video Comprehension

AI Agent in Action: Automating Desktop Tasks with VLMs

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Subscribe to our Newsletter

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

Get Started with OpenCV