Cosmos-Reason2-2B-W4A16

Optimized version of nvidia/Cosmos-Reason2-2B using Quantization. Optimized for reduced GPU memory usage and improved inference efficiency while maintaining high-quality multimodal reasoning performance.

This model was created by quantizing the base language model to INT4 weights while keeping activations in FP16 precision. The model preserves the reasoning capabilities of the original Cosmos-Reason2-2B model while significantly reducing the memory footprint of model weights.


Model Details

Field Value
Base Model nvidia/Cosmos-Reason2-2B
Input / Output Text + Image / Video → Text
Release Date 2026-02-13
Version 1.0
Optimizations Quantization (W4A16)
Developers Embedl
Licenses Upstream: Gemma Terms of Use.
Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use Text generation, reasoning, assistant-style interaction, video analytics, planning, and general-purpose NLP on NVIDIA GPUs

Optimizations

  • Quantization (W4A16) - large reduction in memory footprint and latency.

Quantization Details

  • Algorithm: AWQ
  • Scheme: W4A16
  • Weight Quantization: INT4 (W4)
  • Activation Precision: FP16 (A16)
  • Ignored Modules: lm_head and visual
  • Calibration Datasets: lmms-lab/flickr30k (image) and gigant/webvid-mini (video)
  • Number of Calibration Samples: 1024 (512 image + 512 video)
  • Max Sequence Length: 16_384
  • Quantization Library: llm-compressor

Performance

Text-only inference

NVIDIA Jetson Orin Nano Super

Model e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B OOM OOM OOM OOM
embedl/Cosmos-Reason2-2B-W4A16 4.5489 56.28 17.42 84.61

NVIDIA Jetson AGX Orin

Model e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 5.6196 45.55 21.79 36.40
embedl/Cosmos-Reason2-2B-W4A16 2.5601 100.00 9.85 35.56

Video-only inference

NVIDIA Jetson Orin Nano Super

Model Resolution FPS Frames e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B OOM OOM OOM OOM OOM OOM OOM
embedl/Cosmos-Reason2-2B-W4A16 854x480 2 6 5.1020 50.18 18.89 154.79
embedl/Cosmos-Reason2-2B-W4A16 854x480 4 12 5.1017 50.18 18.88 154.10
embedl/Cosmos-Reason2-2B-W4A16 1280x720 2 6 5.8696 43.61 20.79 324.38
embedl/Cosmos-Reason2-2B-W4A16 1280x720 4 12 5.8635 43.66 20.80 334.08
embedl/Cosmos-Reason2-2B-W4A16 1920x1080 2 6 4.8353 31.02 25.20 650.74
embedl/Cosmos-Reason2-2B-W4A16 1920x1080 4 12 4.8483 30.94 25.23 657.56

NVIDIA Jetson AGX Orin

Model Resolution FPS Frames e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 854x480 2 6 6.0471 42.33 22.59 146.10
embedl/Cosmos-Reason2-2B-W4A16 854x480 2 6 3.0156 84.89 10.77 143.20
nvidia/Cosmos-Reason2-2B 854x480 4 12 6.0397 42.39 22.63 145.79
embedl/Cosmos-Reason2-2B-W4A16 854x480 4 12 3.0035 85.23 10.76 144.12
nvidia/Cosmos-Reason2-2B 1280x720 2 6 6.4872 39.46 23.58 250.17
embedl/Cosmos-Reason2-2B-W4A16 1280x720 2 6 3.4448 74.31 11.75 248.53
nvidia/Cosmos-Reason2-2B 1280x720 4 12 6.4673 39.58 23.56 249.17
embedl/Cosmos-Reason2-2B-W4A16 1280x720 4 12 3.4416 74.38 11.71 243.54
nvidia/Cosmos-Reason2-2B 1920x1080 2 6 7.4784 34.23 25.92 521.39
embedl/Cosmos-Reason2-2B-W4A16 1920x1080 2 6 4.4439 57.61 14.06 507.71
nvidia/Cosmos-Reason2-2B 1920x1080 4 12 7.5190 34.05 25.93 523.36
embedl/Cosmos-Reason2-2B-W4A16 1920x1080 4 12 4.4386 57.68 14.00 508.39

Image-only inference

NVIDIA Jetson Orin Nano Super

Model Resolution e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B OOM OOM OOM OOM OOM
embedl/Cosmos-Reason2-2B-W4A16 854x480 5.5183 42.22 20.88 107.25
embedl/Cosmos-Reason2-2B-W4A16 1280x720 5.5010 42.36 20.87 105.69
embedl/Cosmos-Reason2-2B-W4A16 1920x1080 5.4421 42.81 20.88 105.89

NVIDIA Jetson AGX Orin

Model Resolution e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 854x480 6.0213 38.03 23.66 58.22
embedl/Cosmos-Reason2-2B-W4A16 854x480 3.2648 71.06 11.76 55.98
nvidia/Cosmos-Reason2-2B 1280x720 8.6497 26.47 23.64 68.63
embedl/Cosmos-Reason2-2B-W4A16 1280x720 3.2759 70.82 11.80 54.90
nvidia/Cosmos-Reason2-2B 1920x1080 5.9558 38.45 23.63 57.30
embedl/Cosmos-Reason2-2B-W4A16 1920x1080 3.1621 73.37 11.75 53.14

Measurement setup: NVIDIA vLLM 0.14.0 for Jetson, batch_size=1, max_new_tokens=256, 5 warm-up runs, averaged over 10 runs.


Performance Metric Definitions

e2e Latency (End-to-End Latency): Total time from request submission to completion of the full generated response. This reflects real user-perceived latency. Lower is better.

TPS (Tokens Per Second): Number of output tokens generated per second during the decoding phase. Higher is better.

TPOT (Time Per Output Token): Average time (in milliseconds) required to generate one output token during decoding. Computed as TPOT = (last_token_ts - first_token_ts) / total_output_tokens. Lower is better.

TTFT (Time To First Token): Time from request submission to generation of the first output token. This includes vision encoding, prompt prefill, KV cache initialization. Lower is better.


Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Video Inference

vLLM image: NVIDIA vLLM 0.14.0 for Jetson

Test Hardware: NVIDIA Jetson AGX Orin

--gpu-memory-utilization and --max-num-seqs should be adapted to system specifications (i.e., available RAM).

docker run --rm -it \
  --network host \
  --shm-size=8g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --runtime=nvidia \
  --name=vllm-serve \
  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
  vllm serve "embedl/Cosmos-Reason2-2B-W4A16" \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.75 \
    --max-num-seqs 2

Test Hardware: NVIDIA Jetson AGX Orin, NVIDIA Jetson Orin Nano Super

gpu_memory_utilization and max_num_seqs should be adapted to system specifications (i.e., available RAM).

from vllm import LLM, SamplingParams

if __name__ == "__main__":

    model = "embedl/Cosmos-Reason2-2B-W4A16"
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a helpful assistant."}
            ],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {"url": video_url, "fps": 4},
                },
                {
                    "type": "text",
                    "text": "Describe this video in detail.",
                },
            ],
        },
    ]

    llm = LLM(
        model=model,
        limit_mm_per_prompt={
            "video": {
                "count": 1,
                "num_frames": 12,
                "width": 1920,
                "height": 1080,
            },
            "image": 0,
            "audio": 0,
        },
        media_io_kwargs={"video": {"num_frames": -1}},
        max_model_len=8192,
        mm_processor_kwargs={"truncation": False},
        # System-specific settings - Adapt depending on available RAM
        disable_log_stats=False,
        gpu_memory_utilization=0.75,
        max_num_seqs=2,
    )

    output = llm.chat(
        messages,
        sampling_params=SamplingParams(temperature=0.0, max_tokens=256),
    )
    print(output[0].outputs[0].text)

Transformers Inference

Test Hardware: NVIDIA L4 GPU

Adapted from nvidia/Cosmos-Reason2-2B.

import torch
import transformers

if __name__ == "__main__":
    model_name = "embedl/Cosmos-Reason2-2B-W4A16"
    model = transformers.Qwen3VLForConditionalGeneration.from_pretrained(
        model_name,
        device_map="auto",
        attn_implementation="sdpa",
    )
    processor: transformers.Qwen3VLProcessor = (
        transformers.AutoProcessor.from_pretrained(model_name)
    )
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    video_messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a helpful assistant."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "video", "video": video_url, "fps": 4},
                {"type": "text", "text": "Describe this video in detail."},
            ],
        },
    ]

    # Process inputs
    inputs = processor.apply_chat_template(
        video_messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
        truncation=False,
        fps=4,
    )
    inputs = inputs.to(model.device)

    # Run inference
    generated_ids = model.generate(**inputs, max_new_tokens=4096)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :]
        for in_ids, out_ids in zip(
            inputs.input_ids, generated_ids, strict=False
        )
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False,
    )
    print(output_text[0])

License

Built on NVIDIA Cosmos

This model is a derivative of nvidia/Cosmos-Reason2-2B.

Licensed by NVIDIA Corporation under the NVIDIA Open Model License


Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com


Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

  • Engineering support for on-prem/edge deployments
  • Early access & partner co-marketing opportunities

Contact: sales@embedl.com


Downloads last month
5
Safetensors
Model size
1B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/Cosmos-Reason2-2B-W4A16

Quantized
(2)
this model