New Year Sale - 30% OFF on All Courses and Programs

New Year Sale - 30% OFF on All Courses and Programs

New Year Sale - 30% OFF on All Courses and Programs

New Year Sale - 30% OFF on All Courses and Programs

New Year Sale - 30% OFF on All Courses and Programs

New Year Sale - 30% OFF on All Courses and Programs

  • Home
  • >
  • Deployment
  • >
  • Deployment on Edge: LLM Serving on Jetson using vLLM

Deployment on Edge: LLM Serving on Jetson using vLLM

Learn what it really takes to run LLMs on an 8 GB Jetson Orin Nano, covering setup, failures, memory tuning, and a practical comparison between vLLM and llama.cpp. An article for Deployment on Edge.
LLM serving through vLLM on Jetson Orin Nano

Large Language Models are increasingly moving from the cloud to the edge, driven by requirements for low latency, privacy, and cost-efficient deployment. While vLLM is widely regarded as the throughput leader on server-class GPUs, its behavior on resource-constrained, ARM-based edge devices is far less explored. In this article, we evaluate LLM deployment on the Jetson Orin Nano (8 GB) using vLLM, focusing on real deployment constraints such as memory pressure, concurrency, and stability. Using an in-house ChatML application, we compare vLLM against llama.cpp to understand where each framework fits in practical edge scenarios.

The goal is not synthetic benchmarking, but a deployment-first perspective on running production-style LLM services on an 8 GB edge device.

1. Introduction to SimpleChatUI

SimpleChatUI is a fullstack chatML application built for in-house local inference and experiments. It has both vision and language input processing feature. Integrated document processing with RAG, and Web search using Tavily. The chat history is maintained through in-memory vectorDB quadrant.

ChatML application for testing llm deployment from jetson orin nano

The entire application can be summarised to the following parts.

  • A FastAPI backend managing files, requests, and in-memory database
  • Responsive HTML-CSS frontend for chat

Install the requirements.txt file provided in the directory to sort out simpleChatUI dependencies, and proceed with next steps. You will find the downloadable code below or on the left menu bar.

2. LLM on Edge Deployment Setup

We are using the same edge cluster that we used for VLM on edge experiments earlier. Here, we are only using the Jetson Orin Nano, with 8GB memory, and 256 GB SSD. It is running Jetpack 6 and Cuda 12.6. Check out the Jetson setup video if you are doing it for the first time. For the inference server and chatML app, the system design is as follows.

3. Setup vLLM On Jetson Orin Nano

Although vLLM actively maintains the PyPi version, it did not work for me. The best way is to work with docker containers provided by Nvidia-AI-IoT. Just pull the container, and you are ready to serve models using vLLM.

docker pull ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin

It will take some time depending on the internet speed. Once container is pulled, run it using the following command.

3.1 Run vLLM Container

docker run --rm -it --runtime nvidia \
--gpus all --network host --shm-size=8g  \
-v $HOME/.cache/huggingface:/root/.cache/huggingface   \
-v $PWD:/workspace \
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin

The flags used and their meaning are as follows.

  • --runtime nvidia and --gpus all : Enable CUDA and expose the board’s GPU inside the container. With this framweworks like vLLM, or llama.cpp can use hardware acceleration.
  • --network host: Container shares the host’s network stack, making localhost inside the container identical to the host
  • --shm-size=8g : Increases shared memory to 8 GB, preventing crashes or slowdowns common in ML workloads.
  • $HOME/.cache/huggingface : The volume mount allows downloaded models to persist across container runs

3.2 Start vLLM Engine for Serving LLM on Edge

We are using LiquidAI/LFM2.5-1.2B-Instructmodel for our experiment here. The model will be loaded as it is with half precision – BF16. The serving command is as follows.

vllm serve LiquidAI/LFM2.5-1.2B-Instruct \
--gpu-memory-utilization 0.5 \
--max-model-len 4096 \
--max-num-batched-tokens 1024 \

The command downloads the model from huggingface (if not already done), and starts vllm engine. This setting allows me approx 17 concurrent users, and 40 token/sec throughput. Try increasing the model length, batched tokens, and gpu memory utilization.

The server starts a OpenAI-compatible v1 endpoint at port 8000 by deafult. You can change the port to something else using --portflag. However, make sure to change the port in below applications as well.

Find basic details of deployment engine vLLM here.

3.3 Handling Fail Cases while Deploying LLMs on Jetson

Unable to get server running? Don’t worry, you are not the first. The biggest problem of the Jetson Orin Nano is that 8GB unified memory is too less. All of the system processes are running within this constrained space. There are a few memory optimization techniques, mentioned in official Nvidia Jetson site. You may have to close down few unnecessary services.

Secondly, the issue of memory fragmentation. You can’t really do anything about it. It’s bound to happen over time. Method that helps are as follows:

  • Clear cache using jtop utility once in a while
  • Shutdown and perform cold boot

If still not working, showing OOM, or not enough contiguous blocks. Try enforcing eager backend. This will also disable CUDA graphs. The number of concurrency is reduced to 1. As you can see, I have also reduced GPU utilization limit.

vllm serve LiquidAI/LFM2.5-1.2B-Instruct \
--gpu-memory-utilization 0.4 \
--max-model-len 4096 \
--max-num-batched-tokens 1024 \
--max-num-seq 1 \
--enforce-eager

After going through the above steps, try again. It should work fine now. If it worked at the very first time, it’s fine – now you know what to do when it fails.

4. Start SimpleChatUI ChatML Application

The reason for using only 40% of GPU above is becuase we have to run few other things. The chatML app backend, and something like ngrok or cloudflared tunnel to expose it.

Navigate to the downloaded code directory, and install the requirements.txt file. On a separate terminal session, we will run the following.

export OPENAI_API_BASE=http://0.0.0.0:8000/v1

uvicorn app.main:app --host 0.0.0.0 --port 3000

This will run the application in port http://0.0.0.0:3000. Now, to make the application accessible to anywhere in the world, we will use cloudflared tunnel. Installation commands are as follows. Successful running will return an public URL, something like – https://jellybean-mango-bite.trycloudflare.com.

curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-arm64 \
  -o cloudflared

chmod +x cloudflared
sudo mv cloudflared /usr/local/bin/
# Confirm version with 
cloudflared --version

# Run cloudflared tunnel
cloudflared tunnel --url http://0.0.0.0:3000

The server will stay alive for some time depending on the following. No gurantee of uptime with temporary domains. You can configure personal domains if you have one for persistent link.

  • Jetson in running
  • Terminal not closed
  • Network fluctuations
  • Idle – no requests for long time, cloudflared may reclaim the same

5. Setting Up Web Search API as context for LLM

simpleChatUI is also integrated with websearch as context for the LLM chat. We are using Tavily for the same. For this however, you need a tavily account. It allows 1000 free credits for experiments (as of Jan, 2026).

tavily free web search api

The default API key can be used directly, or refresh a new one specifically for the test case. If the chatML application server is running, close it first. Then run the following command. Make sure to sue the same terminal session. Then run the uvicorn command again.

export TAVILY_API_KEY={add-your-api-key-here-without-curly-braces}

With the web search enabled, play with the chat for some more time. Observe the output quality, and speed. See if it is able to meet your expectations. Next, we will setup llama.cppserver.

6. Build Llama.CPP on Jetson Orin Nano

Llama.cpp is built for efficiency, targeting on-device inference. However, it still takes advantage of GPU backend (if available). To make the comparison fair, we will build with CUDA flag ON. Execute the commands as follows.

Ps: Turn off vLLM server now (if you are working on this step). Building is process intensive as well. No need to turn off cloudflared tunnel and fastapi server.

6.1 Install Dependencies for Llama.CPP Build

sudo apt update
sudo apt install -y \
    git cmake build-essential \
    python3 python3-pip \
    libopenblas-dev \
    libcurl4-openssl-dev

6.2 Clone Repositories and Start Build

git clone https://github.com/ggml-org/llama.cpp.git

cd llama.cpp && mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=87 -DGGML_OPENBLAS=ON
cmake --build . -j$(nproc)

If the build is crashing, or stuck – reduce the core nproc=4 . Upon successfull building, you will observe llama-cli, llama-server, llama-bench and many more binaries in llama.cpp/build/bin directory.

6.3 Start Llama.CPP Server on Jetson Orin Nano

With the binaries built, navigate to llama.cpp/build/bin/ directory and run the following command. We are using the BF16 model, with same model length, for fair comparison. The default port is 8080 for llama.cpp, hence we are using 8000 explicitly.

llama-server -hf LiquidAI/LFM2.5-1.2B-Instruct-GGUF:BF16   \
--port 8000  \
--ctx-size 4096

Reload the chatML application and continue chatting as usual.

7. Results of Deploying LLM on Edge through vLLM and Llama.CPP

For a resource constarined device like Jetson Orin Nano, the results are impressive. It was serving peak throughput of 40 token/sec in some cases. Check the table below for comparison.

7.1. Throughput Comparsion

PromptLlama.CPP (token/sec)vLLM (token/sec)
Hi (First Prompt)~1 ~2
Write an essay on cow35~40
Explain YOLOv4 vs YOLOv52626
Write python code – perform object detection using Ultralytics YOLOv8 medium model. Use OpenCV and related libs to perform the same.2426

Although I expected significantly higher throughput from vLLM, it didn’t happen. It is slightly faster but not by a huge margin.

7.2 Memory Utilization and Concurrency

vLLM consumes significantly high memory 6.4 GB compared to 4.2 GB by llama.cpp server. Concurrency was vLLM is unbeatable.

ParticularsLlama.CPPvLLM
Memory Utilization4.2 GB6.4GB
Max Concurrent Users4 Nos17 Nos

8. Conclusion: Deploying LLM on Edge using vLLM and Llama.CPP

This experiment demonstrates that running production-grade LLM servers on edge devices like Jetson Orin Nano is not only feasible, but practical. Given that expectations are set correctly. Both vLLM and llama.cpp delivered usable throughput, stable responses, and acceptable latency despite the obvious constraints.

From the results, a clear trade-off emerges. llama.cpp excels in memory efficiency, making it a safer choice for single-user or low-concurrency scenarios where system stability is critical. On the other hand, vLLM’s strength lies in concurrency, enabling significantly more simultaneous users.

Interestingly, the performance gap between the two frameworks narrows considerably on ARM-based edge GPUs. This highlights an important insight: on constrained devices, memory bandwidth, fragmentation, and system overhead often dominate over algorithmic optimizations that shine on larger GPUs. In such environments, deployment strategy matters as much as the inference engine itself.

For edge practitioners, the choice is straightforward:

  • Use llama.cpp when memory headroom is tight and user concurrency is low.
  • Use vLLM when serving multiple clients, integrating RAG, or exposing LLMs as shared services, even on small devices.

As Jetson platforms evolve and techniques like quantization, speculative decoding, and KV-cache optimizations mature, the boundary between “edge” and “server-grade” LLM deployment will continue to blur. This experiment is a strong indication that edge AI is no longer limited to inference demos, it is ready for real applications.

With this we wrap up the blog post on deploying LLMs on Edge using vLLM and Llama.CPP. I hope you enjoyed reading the article. Please feel free to comment below for next blog post ideas, bugs, or any general suggestios if any. Happy Learning!



Read Next

VideoRAG: Redefining Long-Context Video Comprehension

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

 

Get Started with OpenCV

Subscribe To Receive

We hate SPAM and promise to keep your email address safe.​