Deploying Open Source Vision Language Models (VLM) on Jetson

Community Article Published February 24, 2026

Upvote

Vision-Language Models (VLMs) mark a significant leap in AI by blending visual perception with semantic reasoning. Moving beyond traditional models constrained by fixed labels, VLMs utilize a joint embedding space to interpret and discuss complex, open-ended environments using natural language.

The rapid evolution of reasoning accuracy and efficiency has made these models ideal for edge devices. The NVIDIA Jetson family, ranging from the high-performance AGX Thor and AGX Orin to the compact Orin Nano Super is purpose-built to drive accelerated applications for physical AI and robotics, providing the optimized runtime necessary for leading open source models.

In this tutorial, we will demonstrate how to deploy the NVIDIA Cosmos Reason 2B model across the Jetson lineup using the vLLM framework. We will also guide you through connecting this model to the Live VLM WebUI, enabling a real-time, webcam-based interface for interactive physical AI.

Prerequisites

Supported Devices:

Jetson AGX Thor Developer Kit
Jetson AGX Orin (64GB / 32GB)
Jetson Orin Super Nano

JetPack Version:

JetPack 6 (L4T r36.x) — for Orin devices
JetPack 7 (L4T r38.x) — for Thor

Storage: NVMe SSD required

~5 GB for the FP8 model weights
~8 GB for the vLLM container image

Accounts:

Create NVIDIA NGC account(free) to download both the model and vLLM contanier

Overview

	Jetson AGX Thor	Jetson AGX Orin	Orin Super Nano
vLLM Container	`nvcr.io/nvidia/vllm:26.01-py3`	`ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04`	`ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04`
Model	FP8 via NGC (volume mount)	FP8 via NGC (volume mount)	FP8 via NGC (volume mount)
Max Model Length	8192 tokens	8192 tokens	256 tokens (memory-constrained)
GPU Memory Util	0.8	0.8	0.65

The workflow is the same for both devices:

Download the FP8 model checkpoint via NGC CLI
Pull the vLLM Docker image for your device
Launch the container with the model mounted as a volume
Connect Live VLM WebUI to the vLLM endpoint

Step 1: Install the NGC CLI

The NGC CLI lets you download model checkpoints from the NVIDIA NGC Catalog.

Download and install

mkdir -p ~/Projects/CosmosReason
cd ~/Projects/CosmosReason

# Download the NGC CLI for ARM64
# Get the latest installer URL from: https://org.ngc.nvidia.com/setup/installers/cli
wget -O ngccli_arm64.zip https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.13.0/files/ngccli_arm64.zip
unzip ngccli_arm64.zip
chmod u+x ngc-cli/ngc

# Add to PATH
export PATH="$PATH:$(pwd)/ngc-cli"

Configure the CLI

ngc config set

You will be prompted for:

API Key — generate one at NGC API Key setup
CLI output format — choose json or ascii
org — press Enter to accept the default

Step 2: Download the Model

Download the FP8 quantized checkpoint. This is used on all Jetson devices:

cd ~/Projects/CosmosReason
ngc registry model download-version "nim/nvidia/cosmos-reason2-2b:1208-fp8-static-kv8"

This creates a directory called cosmos-reason2-2b_v1208-fp8-static-kv8/ containing the model weights. Note the full path — you will mount it into the Docker container as a volume.

Step 3: Pull the vLLM Docker Image

For Jetson AGX Thor

docker pull nvcr.io/nvidia/vllm:26.01-py3

For Jetson AGX Orin / Orin Super Nano

docker pull ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04

Step 4: Serve Cosmos Reason 2B with vLLM

Option A: Jetson AGX Thor

Thor has ample GPU memory and can run the model with a generous context length.

Set the path to your downloaded model and free cached memory on the host:

MODEL_PATH="$HOME/Projects/CosmosReason/cosmos-reason2-2b_v1208-fp8-static-kv8"
sudo sysctl -w vm.drop_caches=3

Launch the container with the model mounted:

docker run --rm -it \
  --runtime nvidia \
  --network host \
  --ipc host \
  -v "$MODEL_PATH:/models/cosmos-reason2-2b:ro" \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  nvcr.io/nvidia/vllm:26.01-py3 \
  bash

Inside the container, activate the environment and serve the model:

vllm serve /models/cosmos-reason2-2b \
  --max-model-len 8192 \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --reasoning-parser qwen3 \
  --gpu-memory-utilization 0.8

Note: The --reasoning-parser qwen3 flag enables chain-of-thought reasoning extraction. The --media-io-kwargs flag configures video frame handling.

Wait until you see:

INFO:     Uvicorn running on http://0.0.0.0:8000

Option B: Jetson AGX Orin

AGX Orin has enough memory to run the model with the same generous parameters as Thor.

Set the path to your downloaded model and free cached memory on the host:

MODEL_PATH="$HOME/Projects/CosmosReason/cosmos-reason2-2b_v1208-fp8-static-kv8"
sudo sysctl -w vm.drop_caches=3

1. Launch the container:

docker run --rm -it \
  --runtime nvidia \
  --network host \
  -v "$MODEL_PATH:/models/cosmos-reason2-2b:ro" \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04 \
  bash

2. Inside the container, activate the environment and serve:

cd /opt/
source venv/bin/activate

vllm serve /models/cosmos-reason2-2b \
  --max-model-len 8192 \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --reasoning-parser qwen3 \
  --gpu-memory-utilization 0.8

Wait until you see:

INFO:     Uvicorn running on http://0.0.0.0:8000

Option C: Jetson Orin Super Nano (memory-constrained)

The Orin Super Nano has significantly less RAM, so we need aggressive memory optimization flags.

Set the path to your downloaded model and free cached memory on the host:

MODEL_PATH="$HOME/Projects/CosmosReason/cosmos-reason2-2b_v1208-fp8-static-kv8"
sudo sysctl -w vm.drop_caches=3

1. Launch the container:

docker run --rm -it \
  --runtime nvidia \
  --network host \
  -v "$MODEL_PATH:/models/cosmos-reason2-2b:ro" \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04 \
  bash

2. Inside the container, activate the environment and serve:

cd /opt/
source venv/bin/activate

vllm serve /models/cosmos-reason2-2b \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --enforce-eager \
  --max-model-len 256 \
  --max-num-batched-tokens 256 \
  --gpu-memory-utilization 0.65 \
  --max-num-seqs 1 \
  --enable-chunked-prefill \
  --limit-mm-per-prompt '{"image":1,"video":1}' \
  --mm-processor-kwargs '{"num_frames":2,"max_pixels":150528}'

Key flags explained (Orin Super Nano only):

Flag	Purpose
`--enforce-eager`	Disables CUDA graphs to save memory
`--max-model-len 256`	Limits context to fit in available memory
`--max-num-batched-tokens 256`	Matches the model length limit
`--gpu-memory-utilization 0.65`	Reserves headroom for system processes
`--max-num-seqs 1`	Single request at a time to minimize memory
`--enable-chunked-prefill`	Processes prefill in chunks for memory efficiency
`--limit-mm-per-prompt`	Limits to 1 image and 1 video per prompt
`--mm-processor-kwargs`	Reduces video frames and image resolution
`--VLLM_SKIP_WARMUP=true`	Skips warmup to save time and memory

Wait until you see the server is ready:

INFO:     Uvicorn running on http://0.0.0.0:8000

Verify the server is running

From another terminal on the Jetson:

curl http://localhost:8000/v1/models

You should see the model listed in the response.

Step 5: Test with a Quick API Call

Before connecting the WebUI, verify the model responds correctly:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/cosmos-reason2-2b",
    "messages": [
      {
        "role": "user",
        "content": "What capabilities do you have?"
      }
    ],
    "max_tokens": 128
  }' | python3 -m json.tool

Tip: The model name used in the API request must match what vLLM reports. Verify with curl http://localhost:8000/v1/models.

Step 6: Connect to Live VLM WebUI

Live VLM WebUI provides a real-time webcam-to-VLM interface. With vLLM serving Cosmos Reason 2B, you can stream your webcam and get live AI analysis with reasoning.

Install Live VLM WebUI

The easiest method is pip (Open another terminal):

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
cd ~/Projects/CosmosReason
uv venv .live-vlm --python 3.12
source .live-vlm/bin/activate
uv pip install live-vlm-webui
live-vlm-webui

Or use Docker:

git clone https://github.com/nvidia-ai-iot/live-vlm-webui.git
cd live-vlm-webui
./scripts/start_container.sh

Configure the WebUI

Open https://localhost:8090 in your browser
Accept the self-signed certificate (click Advanced → Proceed)
In the VLM API Configuration section on the left sidebar:
- Set API Base URL to http://localhost:8000/v1
- Click the Refresh button to detect the model
- Select the Cosmos Reason 2B model from the dropdown
Select your camera and click Start

The WebUI will now stream your webcam frames to Cosmos Reason 2B and display the model’s analysis in real-time.

Recommended WebUI settings for Orin

Since Orin runs with a shorter context length, adjust these settings in the WebUI:

Max Tokens: Set to 100–150 (shorter responses complete faster)
Frame Processing Interval: Set to 60+ (gives the model time between frames)

Troubleshooting

Out of memory on Orin

Problem: vLLM crashes with CUDA out-of-memory errors.

Solution:

Free system memory before starting:
```
sudo sysctl -w vm.drop_caches=3
```
Lower --gpu-memory-utilization (try 0.55 or 0.50)
Reduce --max-model-len further (try 128)
Make sure no other GPU-intensive processes are running

Model not found in WebUI

Problem: The model doesn’t appear in the Live VLM WebUI dropdown.

Solution:

Verify vLLM is running: curl http://localhost:8000/v1/models
Make sure the WebUI API Base URL is set to http://localhost:8000/v1 (not https)
If vLLM and WebUI are in separate containers, use http://<jetson-ip>:8000/v1 instead of localhost

Slow inference on Orin

Problem: Each response takes a very long time.

Solution:

This is expected with the memory-constrained configuration. Cosmos Reason 2B FP8 on Orin prioritizes fitting in memory over speed
Reduce max_tokens in the WebUI to get shorter, faster responses
Increase the frame interval so the model isn’t constantly processing new frames

vLLM fails to load model

Problem: vLLM reports that the model path doesn’t exist or can’t be loaded.

Solution:

Verify the NGC download completed successfully: ls ~/Projects/CosmosReason/cosmos-reason2-2b_v1208-fp8-static-kv8/
Make sure the volume mount path is correct in your docker run command
Check that the model directory is mounted as read-only (:ro) and the path inside the container matches what you pass to vllm serve

Summary

In this tutorial, we showcased how to deploy NVIDIA Cosmos Reason 2B model on Jetson family of devices using vLLM.

The combination of Cosmos Reason 2B’s chain-of-thought capabilities with Live VLM WebUI’s real-time streaming makes it ideal to prototype and evaluate vision AI applications at the edge.

Additional Resources

Cosmos Reason 2B on NVIDIA Build: https://huggingface.co/nvidia/Cosmos-Reason2-2B
NGC Model Catalog: https://catalog.ngc.nvidia.com/
Live VLM WebUI: https://github.com/NVIDIA-AI-IOT/live-vlm-webui
vLLM container for Jetson Thor: https://ghcr.io/nvidia-ai-iot/vllm:r36.4-tegra-aarch64-cu126-22.04
vLLM container for Jetson AGX Orin, and Orin Super Nano: https://nvcr.io/nvidia/vllm:26.01-py3
NGC CLI Installers: https://org.ngc.nvidia.com/setup/installers/cli
Open Models supported on Jetson: https://www.jetson-ai-lab.com/models/
Getting started with Jetson: https://www.jetson-ai-lab.com/tutorials/

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

May 23, 2026

Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

May 18, 2026

Community

raymondlo84-nvidia

Article author Feb 26

If you run into permission error on Docker, please follow the instruction here! :) https://docs.nvidia.com/jetson/agx-thor-devkit/user-guide/latest/setup_docker.html

raymondlo84-nvidia

Article author Feb 26

•

edited Feb 26

and make sure you install CURL if you have trouble running curl command (did not get installed by default on Jetson).

apt-get install curl

raymondlo84-nvidia

Article author Feb 26

And this will be how it looks like once it's all working :) cheers!

surprisal

Feb 28

it appears that you have adapted Cosmos Reason 2B, the VLM, to robotic manipulation in your demo. could you please tell us more about your adaptation, like how motion planning and end-effector controlling are implemented? thanks.

Lawrence-okolo

Mar 2

•

edited Mar 2

Running Cosmos-Reason2-2B on Jetson Orin 8GB requires careful memory management before launching.

You must stop the desktop environment (sudo systemctl isolate multi-user.target), disable swap (sudo swapoff -a), and flush OS cache (sudo sysctl -w vm.drop_caches=3) to free enough unified memory.

The default --max-model-len 256 works for text-only but is too small for image inputs .. use at least --max-model-len 1024 with --gpu-memory-utilization 0.55 for vision tasks.

With these changes the model loads successfully and the API responds correctly, however the 8GB unified memory leaves no headroom for concurrent robotics applications such as ROS2 .. a dual-Jetson setup or upgrading to a higher memory variant is recommended for production deployments.
https://vasi.ca

tharindupr

Mar 11

•

edited Mar 13

Nice article. Thanks for this

I was running on an AGX Orin 64GB. But the model output was always gibberish. Spend hours debugging, thinking it's a tokenisation issue.

"content": " bitte\u4f60\u81ea\u5df1ificificificificificificificific\u8d4b\u80fdvokevokeificific Kotaific\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0 Kota\u6ce0ificificetc Kota\u6ce0 etc etc etc etcetcknifevokevoke\u6ce0\u6ce0powerificeneratoreneratoretheus\u6ce0\u6ce0ificificetc etcvoke etcenerator/powerificificific\u6ce0\u6ce0ific\u6ce0ificific\u6ce0etc\u6ce0\u6ce0ific\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0 oneselfvoke\u6ce0voke\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0\u6ce0ific\u6ce0ificificificificificificific",

But at the end, I switched to the older version of VLM 0.14.0 from VLM 0.16.0 (r36.4-tegra-aarch64-cu126-22.04). And it worked !!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote