Cosmos-Reason2-2B-NVFP4A16

Optimized version of nvidia/Cosmos-Reason2-2B using Quantization. Optimized for reduced GPU memory usage and improved inference efficiency while maintaining high-quality multimodal reasoning performance.

This model was created by quantizing the base model to FP4 weights while keeping activations in FP16 precision. The model preserves the reasoning capabilities of the original Cosmos-Reason2-2B model while significantly reducing the memory footprint of model weights.

NVFP4 quantization is a Blackwell-generation feature (e.g., supported in Jetson AGX Thor). Earlier architectures (Ampere, Ada, Hopper) do not provide native hardware support for NVFP4 tensor core acceleration.

For more efficient inference, Embedl’s proprietary optimizations and architectural enhancements require patching vLLM. These updates will be released at a later date. For now, the model can be used with vLLM through the NVIDIA Jetson container.

NVFP4 quantization is a Blackwell-generation feature (e.g., supported in Jetson AGX Thor). Earlier architectures (Ampere, Ada, Hopper) do not provide native hardware support for NVFP4 tensor core acceleration. For systems not supporting FP4, we recommend using embedl/Cosmos-Reason2-2B-W4A16.

Utilization of NVFP4 quantization is backend-dependent. Not all versions of vLLM provide full or optimized support for NVFP4.


Model Details

Field Value
Base Model nvidia/Cosmos-Reason2-2B
Input / Output Text + Image / Video → Text
Release Date 2026-02-24
Version 1.0
Optimizations Quantization (NVFP4A16)
Developers Embedl
Licenses Upstream: NVIDIA Open Model License, Additional Information: Apache License 2.0, Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use Text generation, reasoning, assistant-style interaction, video analytics, planning, and general-purpose NLP on NVIDIA GPUs

Optimizations

  • Quantization (NVFP4A16) - large reduction in memory footprint and latency.

Accuracy

For comparative evaluation, we present benchmark scores using the Physical AI Bench Reason Task .

We have not been able to reproduce the baseline benchmarks reported by nvidia/Cosmos-Reason2-2B on the Physical AI Bench Leaderboard, see related issue: https://github.com/nvidia-cosmos/cosmos-reason2/issues/52

Overall + Category Scores

Model Overall Embodied Reasoning Common Sense
nvidia/Cosmos-Reason2-2B 50.66
embedl/Cosmos-Reason2-2B-NVFP4A16 49.84 50.16 49.50
embedl/Cosmos-Reason2-2B-W4A16 48.68 50.49 46.85
embedl/Cosmos-Reason2-2B-W4A16-Edge2 (coming soon) 50.58 53.61 47.52

Subcategory Scores

Model AV Physical World Time Space Agibot HoloAssist RoboFail RoboVQA BridgeData V2
nvidia/Cosmos-Reason2-2B
embedl/Cosmos-Reason2-2B-NVFP4A16 44.00 45.13 52.01 52.50 28.00 58.00 51.00 84.55 32.00
embedl/Cosmos-Reason2-2B-W4A16 36.00 47.79 44.30 53.75 36.00 61.00 42.00 80.91 44.00
embedl/Cosmos-Reason2-2B-W4A16-Edge2 (coming soon) 45.00 44.25 48.66 52.50 32.00 59.00 54.00 85.45 43.00

Performance

Test Hardware: NVIDIA Jetson AGX Thor

Text-only Inference

batch_size=1 (concurrency=1)

Model e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 3.9263 65.20 15.27 16.72
embedl/Cosmos-Reason2-2B-NVFP4A16 2.2147 115.59 8.61 10.17
embedl/Cosmos-Reason2-2B-W4A16 2.2340 114.59 8.67 11.13

batch_size=8 (concurrency=8)

Model e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 5.9494 344.24 22.92 65.90
embedl/Cosmos-Reason2-2B-NVFP4A16 2.5627 799.15 9.88 26.92
embedl/Cosmos-Reason2-2B-W4A16 2.2974 891.44 8.86 24.25

Video-only inference

batch_size=1 (concurrency=1)

Model Resolution FPS Frames e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 1280x720 4 12 4.5537 56.22 16.73 192.67
embedl/Cosmos-Reason2-2B-NVFP4A16 1280x720 4 12 2.8978 88.34 10.27 186.62
embedl/Cosmos-Reason2-2B-W4A16 1280x720 4 12 2.5977 98.55 9.10 185.63
Additional Benchmarks
Model Resolution FPS Frames e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 854x480 2 6 5.0284 50.91 18.87 124.78
embedl/Cosmos-Reason2-2B-NVFP4A16 854x480 2 6 2.6527 96.50 9.63 115.92
embedl/Cosmos-Reason2-2B-W4A16 854x480 2 6 2.5015 102.34 9.06 113.30
nvidia/Cosmos-Reason2-2B 854x480 4 12 4.5504 56.26 17.01 123.61
embedl/Cosmos-Reason2-2B-NVFP4A16 854x480 4 12 2.6462 96.74 9.60 115.65
embedl/Cosmos-Reason2-2B-W4A16 854x480 4 12 2.5516 100.33 9.23 116.53
nvidia/Cosmos-Reason2-2B 1920x1080 2 6 5.2182 49.06 18.16 408.99
embedl/Cosmos-Reason2-2B-NVFP4A16 1920x1080 2 6 3.7115 68.98 11.99 411.17
embedl/Cosmos-Reason2-2B-W4A16 1920x1080 2 6 3.5969 71.17 11.52 412.00
nvidia/Cosmos-Reason2-2B 1920x1080 4 12 5.2752 48.53 18.38 409.24
embedl/Cosmos-Reason2-2B-NVFP4A16 1920x1080 4 12 3.7154 68.90 12.00 412.31
embedl/Cosmos-Reason2-2B-W4A16 1920x1080 4 12 3.5756 71.60 11.39 415.89

batch_size=8 (concurrency=8)

Model Resolution FPS Frames e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 1280x720 2 6 7.0704 289.66 19.80 778.89
embedl/Cosmos-Reason2-2B-NVFP4A16 1280x720 2 6 5.1466 397.93 12.00 776.65
embedl/Cosmos-Reason2-2B-W4A16 1280x720 2 6 5.0695 403.98 11.64 779.90
nvidia/Cosmos-Reason2-2B 1280x720 4 12 7.0242 291.56 19.61 779.39
embedl/Cosmos-Reason2-2B-NVFP4A16 1280x720 4 12 5.1346 398.87 11.96 779.69
embedl/Cosmos-Reason2-2B-W4A16 1280x720 4 12 5.0521 405.37 11.63 775.99
Additional Benchmarks
Model Resolution FPS Frames e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 854x480 2 6 5.6951 359.61 18.47 351.83
embedl/Cosmos-Reason2-2B-NVFP4A16 854x480 2 6 3.8647 529.93 11.02 349.41
embedl/Cosmos-Reason2-2B-W4A16 854x480 2 6 3.8613 530.39 10.92 350.04
nvidia/Cosmos-Reason2-2B 854x480 4 12 5.6608 361.79 18.33 354.19
embedl/Cosmos-Reason2-2B-NVFP4A16 854x480 4 12 4.0195 509.52 11.55 349.17
embedl/Cosmos-Reason2-2B-W4A16 854x480 4 12 3.9974 512.34 11.43 349.00
nvidia/Cosmos-Reason2-2B 1920x1080 2 6 10.1481 201.81 22.92 1749.06
embedl/Cosmos-Reason2-2B-NVFP4A16 1920x1080 2 6 8.4629 242.00 16.00 1744.43
embedl/Cosmos-Reason2-2B-W4A16 1920x1080 2 6 8.2949 246.90 15.34 1742.22
nvidia/Cosmos-Reason2-2B 1920x1080 4 12 10.4564 195.86 23.66 1746.96
embedl/Cosmos-Reason2-2B-NVFP4A16 1920x1080 4 12 8.3780 244.45 15.64 1741.12
embedl/Cosmos-Reason2-2B-W4A16 1920x1080 4 12 8.2769 247.43 15.30 1741.64

Image-only Inference

batch_size=1 (concurrency=1)

Model Resolution e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 1280x720 4.8338 52.96 16.96 28.90
embedl/Cosmos-Reason2-2B-NVFP4A16 1280x720 3.0668 83.48 9.68 27.34
embedl/Cosmos-Reason2-2B-W4A16 1280x720 3.0483 83.98 9.81 15.44
Additional Benchmarks
Model Resolution e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 854x480 4.6755 54.75 16.79 27.95
embedl/Cosmos-Reason2-2B-NVAPI4A16 854x480 3.2376 79.07 10.24 28.37
embedl/Cosmos-Reason2-2B-W4A16 854x480 3.1369 81.61 10.18 15.95
nvidia/Cosmos-Reason2-2B 1920x1080 4.9062 52.18 16.89 28.89
embedl/Cosmos-Reason2-2B-NVAPI4A16 1920x1080 2.8774 88.97 9.78 28.01
embedl/Cosmos-Reason2-2B-W4A16 1920x1080 2.8439 90.02 9.71 15.53

batch_size=8 (concurrency=8)

Model Resolution e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 1280x720 10.6920 191.55 25.67 74.92
embedl/Cosmos-Reason2-2B-NVFP4A16 1280x720 8.0958 252.97 13.77 60.15
embedl/Cosmos-Reason2-2B-W4A16 1280x720 9.1849 222.97 12.98 28.81
Additional Benchmarks
Model Resolution e2e(s) TPS TPOT(ms) TTFT(ms)
nvidia/Cosmos-Reason2-2B 854x480 10.4721 195.57 25.16 73.86
embedl/Cosmos-Reason2-2B-NVAPI4A16 854x480 7.0134 292.01 13.61 63.15
embedl/Cosmos-Reason2-2B-W4A16 854x480 6.9387 295.15 12.96 28.91
nvidia/Cosmos-Reason2-2B 1920x1080 11.3606 180.27 25.21 76.51
embedl/Cosmos-Reason2-2B-NVAPI4A16 1920x1080 7.7989 262.60 13.22 29.21
embedl/Cosmos-Reason2-2B-W4A16 1920x1080 6.2494 327.71 12.81 28.80

Measurement setup: NVIDIA vLLM 26.01, 256 tokens generated, 10 warm-up runs, averaged over 25 runs.


Usage Examples

vLLM Video Inference

vLLM image: NVIDIA vLLM 26.01

Test Hardware: NVIDIA Jetson AGX Thor

--gpu-memory-utilization and --max-model-len should be adapted to system specifications (i.e., available RAM).

docker run --rm -it \
  --network host \
  --shm-size=8g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --runtime=nvidia \
  --name=vllm-serve \
  -e HF_TOKEN=hf_*** \
  -e HF_HOME=/root/.cache/huggingface \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "embedl/Cosmos-Reason2-2B-NVFP4A16" \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.9

gpu_memory_utilization and max_num_seqs should be adapted to system specifications (i.e., available RAM).

from vllm import LLM, SamplingParams

if __name__ == "__main__":

    model = "embedl/Cosmos-Reason2-2B-NVFP4A16"
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a helpful assistant."}
            ],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {"url": video_url, "fps": 4},
                },
                {
                    "type": "text",
                    "text": "Describe this video in detail.",
                },
            ],
        },
    ]

    llm = LLM(
        model=model,
        limit_mm_per_prompt={
            "video": {
                "count": 1,
                "num_frames": 12,
                "width": 1920,
                "height": 1080,
            },
            "image": 0,
            "audio": 0,
        },
        media_io_kwargs={"video": {"num_frames": -1}},
        max_model_len=16384,
        mm_processor_kwargs={"truncation": False},
        disable_log_stats=False,
        gpu_memory_utilization=0.9,
    )

    output = llm.chat(
        messages,
        sampling_params=SamplingParams(max_tokens=256),
    )
    print(output[0].outputs[0].text)

Transformers Inference

Test Hardware: NVIDIA H200 GPU

Adapted from nvidia/Cosmos-Reason2-2B.

import torch
import transformers

if __name__ == "__main__":
    model_name = "embedl/Cosmos-Reason2-2B-NVFP4A16"
    model = transformers.Qwen3VLForConditionalGeneration.from_pretrained(
        model_name,
        device_map="auto",
        attn_implementation="sdpa",
        dtype="bfloat16",
    )
    processor: transformers.Qwen3VLProcessor = (
        transformers.AutoProcessor.from_pretrained(model_name)
    )
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    video_messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are a helpful assistant."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "video", "video": video_url, "fps": 4},
                {"type": "text", "text": "Describe this video in detail."},
            ],
        },
    ]

    # Process inputs
    inputs = processor.apply_chat_template(
        video_messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
        truncation=False,
        fps=4,
    )
    inputs = inputs.to(model.device)

    # Run inference
    generated_ids = model.generate(**inputs, max_new_tokens=8192)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :]
        for in_ids, out_ids in zip(
            inputs.input_ids, generated_ids, strict=False
        )
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False,
    )
    print(output_text[0])

Output Example

system: You are a helpful assistant.

user: Describe this video in detail: https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4

The video captures a scene of a highway winding through a wooded area. The road is divided into two lanes, with vehicles traveling in both directions. Cars of various colors and models are seen moving along the highway, some maintaining a steady pace while others navigate curves with apparent ease. Drivers appear attentive, adhering to traffic regulations and ensuring safe navigation through the bends. Amidst this routine movement, a lone individual in a red jacket stands out, seemingly lost in thought or perhaps enjoying the solitude of the surroundings.

On the right side of the highway, there is a pedestrian path that runs parallel to the road. This path is bordered by leafless trees, suggesting it might be late autumn or early spring. Sparse vegetation lines the edge of the path, blending seamlessly with the natural landscape. The trees stand tall and bare, their branches stark against the clear sky.

Prominent in the foreground on the right side of the frame is a distinctive yellow and black striped pole, likely a warning or safety marker for drivers. The pole stands out due to its bright colors, contrasting sharply with the muted tones of the surrounding environment.

The overall atmosphere of the video is one of calmness and orderliness, with the steady flow of traffic and the peaceful setting creating a harmonious blend of human activity and nature. The absence of any visible congestion or adverse weather conditions adds to the serene ambiance of the scene.


system: You are a helpful assistant.

user: Describe this image in detail: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg

The image captures a serene moment on a sandy beach during sunset. A woman is seated on the sand, her body angled towards the right side of the frame. She is dressed in a plaid shirt and black pants, with her hair loosely tied back. Her face is lit up with a warm smile, reflecting a sense of joy and contentment. She is holding a small object in her hands, possibly a piece of jewelry or a token, which she seems to be examining closely.

Beside her, a golden Labrador Retriever sits patiently. The dog is adorned with a blue harness featuring paw prints, adding a charming detail to its appearance. The Labrador's fur is a rich golden color, contrasting beautifully with the soft hues of the sunset. The dog gazes attentively at the woman, creating a bond between them.

The background reveals a tranquil seascape with gentle waves lapping at the shore. The horizon is tinged with the warm glow of the setting sun, casting a golden hue over the entire scene. The sky above is a soft gradient of colors, transitioning from a pale blue to a warm orange near the horizon. The overall atmosphere of the image is peaceful and idyllic, evoking a sense of harmony between the woman, her beloved dog, and ...


License

Built on NVIDIA Cosmos

This model is a derivative of nvidia/Cosmos-Reason2-2B.

Licensed by NVIDIA Corporation under the NVIDIA Open Model License


Contact

Enterprise & Commercial Inquiries contact@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com


Partner & Developer Opportunities

If you are evaluating on-device inference, building products on this model, or exploring custom model optimization, reach out for:

  • Engineering support for on-prem/edge deployments
  • Early access & partner co-marketing opportunities

Contact: contact@embedl.com


Downloads last month
65
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/Cosmos-Reason2-2B-NVFP4A16

Quantized
(4)
this model