With the convergence of AI towards truly intelligent Vision Language Models (VLMs), the computational complexities are rising more than ever. The usability of the resource hungry models diminishes due to high operation cost, raising questions of a sustainable future. In 2018, Pete Warden from TensorflowLite quoted – “The future of machine learning is tiny”. Fast forward to 2025, after seven long years; are we there in that future yet? Let’s test some VLM on edge using our custom built Raspberry Pi cluster and Jetson Nano boards.
In this series of blog posts, we will perform experiments extensively on various models. The Objective is to find the VLMs that are fast, efficient, and practical for edge deployment.
- The Pi and Nano Cluster Setup
- Why the Edge Device Cluster for Running VLMs?
- Setup for Inferencing VLM on Edge Devices
- Vision Language Model Evaluation
- Code for Running VLM on Edge
- Inference Using Qwen2.5VL(3B)
- Inference Using Moondream2
The Pi and Nano Cluster Setup

Did you know that the width of Raspberry Pi is exactly 3.14 Inch? No it isn’t! I am just making things up. It has absolutely nothing to do with Pi. Jokes aside, we have used the following boards to build the cluster.
- RPi 2 Model B – 2 GB (64 GB Card Storage), No Cooling
- RPi 4 Model B – 4 GB (64 GB Card Storage), No Cooling
- RPi 4 Model B – 8 GB (64 GB Card Storage), No Cooling
- RPi 5 – 8GB (64 GB Card Storage), No Cooling
- Jetson Nano Devkit 2GB, Heat Sink, No Fan
- Jetson Nano 4GB, Heat Sink, No fan
- Jetson Orin Nano, 8GB (256GB SSD), Heat Sink, With Fan
Other auxiliary components include – ethernet switch and powering module. We are using all developer boards directly without any modification. Initially we wanted to see how they perform in an out of the box condition. Hence, no additional heatsinks, or cooling fans have been added. So this is not exactly a test to compare the devices. Discussion of the full hardware setup wil be done in another article soon.
Do you think the boards will hold or just fry off? Continue reading to find out.
Why the Edge Device Cluster for Running VLMs?
A cluster isn’t just a fun DIY project. It can be customized as much as you can imagine. Following are a few advantages of building a custom one.
- Single switch and ethernet connectivity
- Easy to monitor and manage
- Clean and minimal setup
- Scalable architecture
- Perfect for experiments
Setup for Inferencing VLM on Edge Devices
There are many models who claim to be efficiently running on edge devices with RAM as low as 2GB. All will be tested eventually. For the scope of our blog post, we chose Moondream2, and Qwen2.5VL.
We will access all the boards remotely using SSH from a PC. This allows side by side comparison of the devices. To download and manage models locally, we will use Ollama. The models will be loaded without further quantization.
1.1 What is Ollama?
Ollama is a lightweight, cross platform framework to download, run, and manage VLMs (and LLMs) directly in a local device. It offers CLI interface, GUI (for windows only as of Sept, 2025), and most importantly – a python SDK. The python client library is available in PyPi. It wraps the local HTTP API of Ollama which interacts with the python environment directly.
Ollama has its own curated library from which models can be downloaded. The models are in GGUF + Modelfile format. GGUF – GPT Generated Unified Format and Modelfile is similar to a requirements.txt file containing instructions for the model. Maintaining the requirements, you can also use your own models in Ollama.
1.2 Install Ollama on Devices
Go ahead and install Ollama in your system. It is available for Windows, Linux, and Mac on the official site. Note that the python client has to be installed from PyPi separately in your environment using command pip install ollama. We have installed Ollama in all the boards in the same manner.
Vision Language Model Evaluation
Evaluating a VLM is a bit more complex that that of an unimodal (just vision or just language). They have to be good at both perception and reasoning across modalities. Here, the methods are task specific. We have simplified our tests and comparisons across the following tasks from a broader perspective.
(i). Image Captioning: It is the task of generating a natural language sentence that describes the overall content of the image.
- Output: A sentence or a paragraph.
- Focus: Global understanding, generalization.
(ii). Visual Question Answering (VQA): It is about answering a question in natural language about the image. The output can be a single number, word, sentence, or a paragraph.
(iii). Visual Grounding: The model’s ability to identify and localize objects in an image. The output can be a sentence describing position, bounding box coordinates, or mask of an object.
(iv). Image to Retrieval: As the name suggests, it’s the ability of a model to comprehend texts from an image.
Note: There are many more tasks such as Cross Modal Retrieval, Compositional and Logical Reasoning, Video Understanding for Temporal Reasoning etc. Each task has various extensive datasets. There are quite a few research papers and benchmarks to evaluate Vision Language Models. Follow this article to get deeper insight of evaluation metrics of VLMs.
Code for Running VLM on Edge Devices
Download the models by using the following commands. We will fetch qwen2.5vl:3b and moondream. It might take some time depending on the speed of your connection.
ollama pull qwen2.5vl:3b
ollama pull moondream
The following snippet of code will load the model, define your image, and query. We have added an argument parser to modify the model, image, or the query as required. The response or output is obtained from the function ollama.chat()
. It accepts parameters – model name, query, and the image path.
# Import libraries.
import ollama
import time
import argparse
# Define main function
def main():
# Add argument parser
parser = argparse.ArgumentParser(description="Run an Ollama vision-language model with image + query")
parser.add_argument("--model", type=str, default="qwen2.5vl:3b", help="Model name (default: qwen2.5vl:3b)")
parser.add_argument("--image", type=str, default="./tasks/esp32-devkitC-v4-pinout.png", help="Path to input image")
parser.add_argument("--query", type=str, default="Describe the contents of this image in 100 words.", help="Query string for the model")
args = parser.parse_args()
# Init start time variable to measure generation time
start_time = time.time()
# Obtain the model response
response = ollama.chat(
model=args.model,
messages=[
{
"role": "user",
"content": args.query,
"images": [args.image],
}
]
)
end_time = time.time()
print("Model Output:\n", response["message"]["content"])
print("\nGeneration Time: {:.2f} seconds".format(end_time - start_time))
if __name__ == "__main__":
main()
Inference Using Qwen2.5VL(3B)
Developed by Qwen Team at Alibaba Cloud as a part of the Qwen VL(Vision Language) series. Released in Jan, 2025 Qwen2.5VL(3B) outperforms Qwen2VL(7B). The suffix 3B stands for 3 billion parameters. It is lightweight but powerful.
- RAM Consumed: ~ 5GB
- Model size: 3.2 GB
4.1 Key Features of Qwen2.5VL (3B)
The primary features or task capabilities of the model are as follows.
- Multimodal perception
- Agentic interactivity: Capable of operating tools like desktop or mobile interfaces.
- Extended video comprehension: Hour long video analysis using dynamic frame-rate sampling and temporal encoding
- Precise visual localization: Generate bounding boxes, points in JSON format etc.
- Structured data extraction: Parsing docs like invoice, forms, tables etc into structured format.
Check out this article on Qwen2.5VL for in-depth analysis of the architecture and application for Video Analysis and Content Moderation.
4.2 Visual Question Answer (VQA) Test on Qwen2.5VL(3B)

We are using two images here – an image with a pothole, and a person tripping at shopfloor due to some cables. The VQA tasks for the model are as follows:
- How many potholes are there in the image?
- Why is the person falling?
Click on the image below for an enlarged view. As you can see, the devices below 4GB RAMs are not able to run the model. We could increase the SWAP size, however the idea is to check the model’s performance out of the box.
The model is able to precisely calculate the number of potholes in all the boards. Time taken for inference is varying drastically. You might be worndering why RTX 3060 12 GB GPU is there in an article about VLM on edge? It is added just for reference(pinned on the top left corner). The time taken by Jetson orin Nano 8G and RTX 3060 is comparable at 4 and 9.4 seconds respectively. On the other hand, the Pis took close to 2 minutes and 5 minutes respectively.
All the boards (above 4GB) are able to run the model correctly with similarity patterns in time taken. It has reduced to 2.48s and 2.78s in RTX 3060 and Orin Nano. Given the footprint of the Jetson Orin Nano, the performance is commendable.
4.3 OCR Capability of Qwen2.5VL(3B)
We are passing an image with a simple heading and a paragraph of Gemma 2 paper for our model to analyse. The text input is – “Read the text in the image and explain”. All devices above 4GB RAM are able to run it well.
4.4 Image Captioning using Qwen2.5VL(3b)
We are using an image containing a PIN out diagram of ESP32 developer board from Espressif. The task for the model is to generate an overall description of the image in 100 words.
By the time we reached the captioning test, the Raspberry Pis called it a day. They were throttling as the temperature reached 90 degree celsius. Could not generate output even after 10 minutes. We can see that Pi4b 4G – which was not running is at 47 degree celsius. Whereas Pi5 and Pi4b trying to run captioning is at 90 degree celsius. It is clear than for running VLM on edge devices, at least passive cooling using heat sink is a must.
4.5 Discussion of Results

Clearly Jetson Orin Nano is performing well in terms of speed and accuracy. However, we should not forget that the Orin Nano board has a huge sink and an active fan (out of the box). This is not a fair comparison at all. Again, this is not exactly a comparison.

Inference Using Moondream2 the Perfect VLM on Edge
Moondream is the world’s tiniest vision language model with just 1.8 billion parameters. Moondream can run on devices having memory of just 2GB. This makes it ideal for almost all edge devices.
- Memory Required: Less than 2GB
- Model size: 1.7 GB
- Model: sigLIP based
- Projector: Phi-1.5 based
It is built using distilled and fine-tuned components from SigLIP (a sigmoid-based contrastive loss model) and Microsoft’s Phi-1.5 language model. Read more about DeepMind SigLIP here.
For testing, the Moondream2 model is subjected to the same tests as above and results are recorded.
5.1 Visual Question Answer (VQA) Test on Moondream2
We can see that Moondream2 is much faster. Interestingly, Jetson Orin Nano displayed garbage output. We observed this error with only Moondream. Upon trying other models like LlaVA 7b, Llava-llama3, Gemma2:4b, Gemma3n etc. Few users have reported the issue as well. We will dive deeper and update on the causes and fixes later. Also some wrong counts were observed with RTX 3060 and Jetson Nano 4GB.
Moondream2 is saying the person might have fallen due to running. It has also detected the entangled cable. However, it is not able to connect the relation between falling and entanglement. Qwen2.5VL on the other hand, is able to detect the reason clearly, takes a little longer.
5.2 OCR Capability of Moondream2
The character recognition is decent and all devices are able to recognize the content. Excluding Jetson Orin Nano who defaulted to generating “GGGG” like case 1.
5.3 Image Captioning using Moondream2
Although the devices took their time, all are performing as expected. It is able to clearly describe that there is a Esp32 devkit pin out diagram and related information. Minor hallucinations are observed.
Although the devices took their time, all are performing as expected. It is able to clearly describe that there is a Esp32 devkit pin out diagram and related information. Minor hallucinations are observed.
5.4 Discussion of Results


CONCLUSION
With this we wrap up experimenting with VLMs on edge devices out of the box. Moondream2 is designed for compact, edge-friendly inference, and appears to operate with a relatively limited token context (Approx. 1000 tokens), suitable for quick multimodal tasks on constrained hardware.
Qwen2.5-VL (3B) is a much more capable multimodal model, supporting very long context windows, up to 125K tokens, enabling processing of large documents, videos, multi-image sequences, and agentic pipelines.
This was just about setting up the basic test pipeline. Next, we will add heat sinks, cooling fans, and SSDs for better performance. We will test more models from huggingface by installing transformers.
I hope you liked reading the article VLM on Edge Devices. Please add your feedback in the comments section below. Any suggestions or requests are always welcome.
- VLM on Edge: Worth the Hype or Just a Novelty?
- AnomalyCLIP : Harnessing CLIP for Weakly-Supervised Video Anomaly Recognition
- AI for Video Understanding: From Content Moderation to Summarization
- Video-RAG: Training-Free Retrieval for Long-Video LVLMs
- Object Detection and Spatial Understanding with VLMs ft. Qwen2.5-VL