Benchmarks + Report: Optimized Cosmos-Reason2 (Qwen3-VL) for on-device inference on 8GB RAM (Jetson Orin Nano Super)

Community Article Published February 28, 2026

Hej, Researcher from Embedl here! Leading up to Nvidia GTC we have been focusing on getting nvidia/Cosmos-Reason2-2B (fine-tuned variant of Qwen3-VL) edge-ready. Meaning, enabling it for the full Jetson-lineup: From 8GB RAM on Jetson Orin Nano to 64GB RAM on Jetson AGX Orin up to 128GB RAM on Jetson AGX Thor ~ a bit over-kill the last one. :)

From the very fist quantized variant embedl/Cosmos-Reason2-2B-W4A16 to our most recent release embedl/Cosmos-Reason2-2B-W4A16-Edge2 where we did an extensive search over mixed-precision settings to find this optimal variant with near-zero drop in accurracy compared to the full FP16 baseline and matching W4A16 on-device performance.

All Benchmark on real hardware, running locally on the Nvidia Jetson lineup with vllm serve
Accuracy (Vision and Reasoning capabilities) evaluated on the Physical Al Bench Tasks
Benchmarks comparing NVFP4A16 and W4A16 on AGX Thor
Easy to try-out on-device with vllm
There are some open issues we submitted to the open source community as another outcome from our research

PAI Reason Task Evaluation

Overall + Category Scores

Model	Overall	Embodied Reasoning	Common Sense
nvidia/Cosmos-Reason2-2B	52.64	55.25	50.00
embedl/Cosmos-Reason2-2B-W4A16	48.68	50.49	46.85
embedl/Cosmos-Reason2-2B-W4A16-Edge2	51.73	54.43	49.01

More detailed sub-task scores here: Edge-Inference-Benchmarks.

Background: Cosmos-Reason2 and Qwen3-VL

Cosmos-Reason2 is essentially a fine-tuned Qwen3-VL with similar multi-modal input (text + image/video → text).

Cosmos is finetuned particular for temporal/physical reasoning tasks and planning, while Qwen3-VL is more general “world knowledge + detailed description.” Thus, in essence, Cosmos has a similar use cases to Qwen3-VL but with added embodied reasoning for video/physics contexts.

Fun fact: To the question "Who are you?" the Cosmos model always replies something along the lines "I am Qwen..." :D

Here is what we found:

Some layers are very sensitive to quantization. While our first released W4A16 was the very first released model enabling deployment on Jetson Orin Nano. Objectively, it is a great model with ~2%-point drop in accuracy compared to the baseline's model avcuracy. However, we wanted to see how far we can reduce that drop and applied our EdgeN quantization search algorithm, resulting in the W4A16-Edge2 version with a mere 0.02%-point drop in accuracy. EdgeN produces the full pareto front (accuracy-latency tradeoff) of optimal models by excluding sensitive layers from quantization (among a few other tricks :) ).

NVFP4A16 may not be optimal for all tensors. When first comparing FP4 vs INT4 weights on AGX Thor we were a bit underwhelmed to be honest. Our experiments and previous research has shown that using NVFP4 for all tensors is not a good idea. This model would also benefit from a more sophisticated search like we did for the Edge2 variant. And for such a small 2B parameter model the AGX Thor with 128GB RAM may anyway be a bit overpowered and we may see more benefits from FP4 with higher batch size / concurrency.

What are your experiences?

Is NVFP4 worth it? For now, at least for the small 2B Cosmos, it is quite inference-stack depending to really make full use of FP4 weights.

So, how do these models perform on device?

We benchmarked accross the three modalities (text, image, video), three hardware (Orin Nano Super, AGX Orin, AGX Thor), three resolutions (1920x1080:FHD, 1280x720:HD, 854x480), with 6 and 12 frames, and single concurrency and concurrency 8 (batch size 8).

Is there any setup / benchmark you are missing here?

Edge Inference Benchmarks for Cosmos-Reason2-2B

Baseline nvidia/Cosmos-Reason2-2B is OOM on Jetson Orin Nano.

Explore Models here: collections/embedl/cosmos-reason2

Quickstart: vllm serve

docker run --rm -it \
  --network host \
  --shm-size=8g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --runtime=nvidia \
  --name=vllm-serve \
  -e HF_TOKEN=hf_*** \
  -e HF_HOME=/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve "embedl/Cosmos-Reason2-2B-W4A16-Edge2" \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.75 \
    --max-num-seqs 2

Real output of embedl/Cosmos-Reason2-2B-W4A16-Edge2 captured from NVIDIA Jetson Orin Nano Super.

system: You are a helpful assistant.

user: Describe the sequence of actions from start to end, and explain what changes over time in this video: https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/assets/kitchen_fridge_input.mp4

assistant: Initially, the skeleton character stands facing the refrigerator, positioned on the left side of the frame.

The skeleton character then reaches out with its right hand to open the refrigerator door.

As the skeleton opens the refrigerator, the door swings outward, revealing the interior shelves.

Models mentioned in this article 3

Collections mentioned in this article 1

How to Build a vLLM Plugin: A Guide to the general_plugins Entry Point

April 10, 2026

FlashHead: Accelerating Language Model Inference ~ Efficient drop-in replacement for the classification head

March 11, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Benchmarks + Report: Optimized Cosmos-Reason2 (Qwen3-VL) for on-device inference on 8GB RAM (Jetson Orin Nano Super)

PAI Reason Task Evaluation

Overall + Category Scores

Background: Cosmos-Reason2 and Qwen3-VL

Here is what we found:

So, how do these models perform on device?

Quickstart: vllm serve

Models mentioned in this article 3

Collections mentioned in this article 1

How to Build a vLLM Plugin: A Guide to the general_plugins Entry Point

FlashHead: Accelerating Language Model Inference ~ *Efficient drop-in replacement for the classification head*

Community

Models mentioned in this article 3

Collections mentioned in this article 1

FlashHead: Accelerating Language Model Inference ~ Efficient drop-in replacement for the classification head