VLMs

The Ultimate Guide to Vector DB and RAG Pipeline

Processing long documents with VLMs or LLMs poses a fundamental challenge: input size exceeds context limits. Even with GPUs, as large as 12 GB can barely process 3-4 pages at

Generative AI, VLMs

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from application APIs.

Agentic AI, GUI, VLMs

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Computer Vision, VLMs

Getting Started with VLM on Jetson Nano

Learn how to setup a pipeline to run VLM on Jetson Nano using Huggingface Transformers. Run models like LiquidAI, Moondream2, FastVLM, and SmolVLM.

Edge Devices, Jetson Nano, VLM on Jetson Nano, VLMs

VLM on Edge: Worth the Hype or Just a Novelty?

Testing Vision Language Models (VLM) on edge devices. Check how small VLMs perform on our custom Raspberry Pi Cluster and Jetson Nanos.

Edge Devices, Jetson Nano, Raspberry Pi, VLMs

AnomalyCLIP : Harnessing CLIP for Weakly-Supervised Video Anomaly Recognition

Video Anomaly Detection (VAD) is one of the most challenging problems in computer vision. It involves identifying rare, abnormal events in videos – such as burglary, fighting, or accidents –

Anomaly Detection, Vision Transformer, VLMs

Video-RAG: Training-Free Retrieval for Long-Video LVLMs

Learn how Video-RAG boosts training-free and low-compute long-video understanding by pairing OCR, ASR, and open-vocabulary detection with any long-video LVLMs.

RAGs, VLMs

Object Detection and Spatial Understanding with VLMs ft. Qwen2.5-VL

What if object detection wasn't just about drawing boxes, but about having a conversation with an image? Dive deep into the world of Vision Language Models (VLMs) and see how

Computer Vision, LLMs, NLP, Vision Language Models, VLMs

SimLingo: Vision-Language-Action Model for Autonomous Driving

SimLingo is a remarkable model that combines autonomous driving, language understanding, and instruction-aware control—all in one unified, camera-only framework. It not only delivered top rankings on CARLA Leaderboard 2.0 and

Advanced Driver Assistance Systems, Autonomous Vehicle, Computer Vision, Robotics, VLMs

FineTuning Gemma 3n for Medical VQA on ROCOv2

What if a radiologist facing a complex scan in the middle of the night could ask an AI assistant for a second opinion, right from their local workstation? This isn't

Computer Vision, Generative AI, Generative Models, LLMs, Multimodal Models, NLP, Transformer Neural Networks, Vision Language Models, Vision Transformer, VLMs

Building an Agentic Browser with LangGraph: A Visual Automation and Summarization Pipeline

Developing intelligent agents, using LLMs like GPT-4o, Gemini, etc., that can perform tasks requiring multiple steps, adapt to changing information, and make decisions is a core challenge in AI development.

Agentic AI, Computer Vision, Generative AI, Generative Models, LLMs, VLMs

Fine-Tuning AnomalyCLIP: Class-Agnostic Zero-Shot Anomaly Detection

Zero-shot anomaly detection (ZSAD) is a vital problem in computer vision, particularly in real-world scenarios where labeled anomalies are scarce or unavailable. Traditional vision-language models (VLMs) like CLIP fall short

Anomaly Detection, Vision Transformer, VLMs