Benchmarks + Report: Optimized Cosmos-Reason2 (Qwen3-VL) for on-device inference on 8GB RAM (Jetson Orin Nano Super)

Community Article Published February 28, 2026

Hej, Researcher from Embedl here! Leading up to Nvidia GTC we have been focusing on getting nvidia/Cosmos-Reason2-2B (fine-tuned variant of Qwen3-VL) edge-ready. Meaning, enabling it for the full Jetson-lineup: From 8GB RAM on Jetson Orin Nano to 64GB RAM on Jetson AGX Orin up to 128GB RAM on Jetson AGX Thor ~ a bit over-kill the last one. :)

From the very fist quantized variant embedl/Cosmos-Reason2-2B-W4A16 to our most recent release embedl/Cosmos-Reason2-2B-W4A16-Edge2 where we did an extensive search over mixed-precision settings to find this optimal variant with near-zero drop in accurracy compared to the full FP16 baseline and matching W4A16 on-device performance.

  • All Benchmark on real hardware, running locally on the Nvidia Jetson lineup with vllm serve
  • Accuracy (Vision and Reasoning capabilities) evaluated on the Physical Al Bench Tasks
  • Benchmarks comparing NVFP4A16 and W4A16 on AGX Thor
  • Easy to try-out on-device with vllm
  • There are some open issues we submitted to the open source community as another outcome from our research

PAI Reason Task Evaluation

Overall + Category Scores

Model Overall Embodied Reasoning Common Sense
nvidia/Cosmos-Reason2-2B 50.60 53.93 47.19
embedl/Cosmos-Reason2-2B-W4A16 48.68 50.49 46.85
embedl/Cosmos-Reason2-2B-W4A16-Edge2 50.58 53.61 47.52
embedl/Cosmos-Reason2-2B-NVFP4A16 49.84 50.16 49.50

Subcategory Scores

Model AV Physical World Time Space Agibot HoloAssist RoboFail RoboVQA BridgeData V2
nvidia/Cosmos-Reason2-2B 44.00 46.90 45.30 55.00 34.00 60.00 49.00 90.91 42.00
embedl/Cosmos-Reason2-2B-W4A16 36.00 47.79 44.30 53.75 36.00 61.00 42.00 80.91 44.00
embedl/Cosmos-Reason2-2B-W4A16-Edge2 45.00 44.25 48.66 52.50 32.00 59.00 54.00 85.45 43.00
embedl/Cosmos-Reason2-2B-NVFP4A16 44.00 45.13 52.01 52.50 28.00 58.00 51.00 84.55 32.00

Background: Cosmos-Reason2 and Qwen3-VL

Cosmos-Reason2 is essentially a fine-tuned Qwen3-VL with similar multi-modal input (text + image/video → text).

Cosmos is finetuned particular for temporal/physical reasoning tasks and planning, while Qwen3-VL is more general “world knowledge + detailed description.” Thus, in essence, Cosmos has a similar use cases to Qwen3-VL but with added embodied reasoning for video/physics contexts.

Fun fact: To the question "Who are you?" the Cosmos model always replies something along the lines "I am Qwen..." :D

Here is what we found:

Some layers are very sensitive to quantization. While our first released W4A16 was the very first released model enabling deployment on Jetson Orin Nano. Objectively, it is a great model with ~2%-point drop in accuracy compared to the baseline's model avcuracy. However, we wanted to see how far we can reduce that drop and applied our EdgeN quantization search algorithm, resulting in the W4A16-Edge2 version with a mere 0.02%-point drop in accuracy. EdgeN produces the full pareto front (accuracy-latency tradeoff) of optimal models by excluding sensitive layers from quantization (among a few other tricks :) ).

NVFP4A16 may not be optimal for all tensors. When first comparing FP4 vs INT4 weights on AGX Thor we were a bit underwhelmed to be honest. Our experiments and previous research has shown that using NVFP4 for all tensors is not a good idea. This model would also benefit from a more sophisticated search like we did for the Edge2 variant. And for such a small 2B parameter model the AGX Thor with 128GB RAM may anyway be a bit overpowered and we may see more benefits from FP4 with higher batch size / concurrency.

What are your experiences?

Is NVFP4 worth it? For now, at least for the small 2B Cosmos, it is quite inference-stack depending to really make full use of FP4 weights.

So, how do these models perform on device?

We benchmarked accross the three modalities (text, image, video), three hardware (Orin Nano Super, AGX Orin, AGX Thor), three resolutions (1920x1080:FHD, 1280x720:HD, 854x480), with 6 and 12 frames, and single concurrency and concurrency 8 (batch size 8).

Is there any setup / benchmark you are missing here?

Screenshot_20260228_121157_Chrome

Baseline nvidia/Cosmos-Reason2-2B is OOM on Jetson Orin Nano. Edge Inference Benchmarks space will be released shortly, for now, benchmarks are available on the model cards.

Explore Models here: collections/embedl/cosmos-reason2

Quickstart: vllm serve

docker run --rm -it \
  --network host \
  --shm-size=8g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --runtime=nvidia \
  --name=vllm-serve \
  -e HF_TOKEN=hf_*** \
  -e HF_HOME=/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve "embedl/Cosmos-Reason2-2B-W4A16-Edge2" \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.75 \
    --max-num-seqs 2

Real output of embedl/Cosmos-Reason2-2B-W4A16-Edge2 captured from NVIDIA Jetson Orin Nano Super.

system: You are a helpful assistant.

user: Describe the sequence of actions from start to end, and explain what changes over time in this video: https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/assets/kitchen_fridge_input.mp4

assistant: Initially, the skeleton character stands facing the refrigerator, positioned on the left side of the frame.

The skeleton character then reaches out with its right hand to open the refrigerator door.

As the skeleton opens the refrigerator, the door swings outward, revealing the interior shelves.

Community

Sign up or log in to comment