You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

The information you provide will be collected, stored, processed and shared in accordance with the Embedl Privacy Policy.

Optimized by Embedl

Need to fine-tune, hit performance targets, or deploy on specific hardware?

We've got you covered.

Learn more Get in touch →

Cosmos-Reason2-2B-W4A16-Edge2-FlashHead

Optimized version of nvidia/Cosmos-Reason2-2B using quantization, targeted mixed-precision exclusions, and FlashHead, Embedl's efficient replacement for the language model head.

This model builds on embedl/Cosmos-Reason2-2B-W4A16-Edge2 and adds FlashHead, a lightweight replacement for the dense LM head that significantly improves throughput while preserving accuracy.

This model was created by quantizing the base language model to INT4 weights while keeping activations and a small selection of sensitive layers in FP16 precision. The model preserves Text + Image / Video -> Text behavior and reasoning capabilities while reducing memory footprint, enabling the Cosmos-Reason2 model to run even on an NVIDIA Jetson Orin Nano Super.

FlashHead is available as a vLLM plugin via pip install flash-head. See the Usage Examples section below.

Output Example

Real output captured from NVIDIA Jetson Orin Nano Super.

system: You are a helpful assistant.

user: Describe this video in detail: https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4

assistant: The video showcases a dynamic scene of vehicular movement along a multi-laned highway, set against a picturesque backdrop of leafless trees and barren landscapes indicative of a winter or early spring setting. The highway features two lanes in each direction, separated by a median strip bordered by grassy embankments. Vehicles are traveling in both directions, with traffic flowing smoothly in the opposite directions.

In the foreground, a white sedan is prominently visible, moving away from the camera towards the right side of the frame. Its rear lights are illuminated, suggesting it might be braking or slowing down. Further down the road, a red hatchback follows closely behind, maintaining a safe distance from the white sedan. Further ahead, a silver hatchback is seen driving in the same direction as the white sedan, maintaining a consistent pace.

On the opposite side of the highway, a white pickup truck is traveling in the same direction as the white sedan, maintaining a steady speed. Further ahead, a blue hatchback is also traveling in the same direction as the white sedan, keeping a safe distance from the other vehicles. Additional vehicles can be seen further down the road, including a red hatchback and a black hatchback, all moving in the same direction as the white sedan.

Model Details

Field	Value
Model	embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead
Base Model	nvidia/Cosmos-Reason2-2B
Input / Output	Text + Image / Video -> Text
Release Date	2026-03-11
Version	1.0
Optimizations	FlashHead LM Head + Quantization (W4A16) + Mixed Precision Exclusions (Edge2)
Developers	Embedl
Licenses	Upstream: NVIDIA Open Model License. Additional Information: Apache License 2.0. Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use	Text generation, reasoning, assistant-style interaction, video analytics, planning, and general-purpose multimodal NLP on NVIDIA GPUs

Architecture graph for embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead. Open in hfviewer

Optimizations

FlashHead LM Head: lightweight replacement for the dense LM head, significantly improving throughput.
Quantization (W4A16): INT4 weights + FP16 activations.
Edge2 mixed exclusions: Keep sensitive layers in FP16 precision.

Benchmarks

Accuracy and on-device latency benchmarks can be explored on embedl/Edge-Inference-Benchmarks.

---

Installation

pip install flash-head

The flash-head vLLM plugin is required. It activates automatically at startup.

Usage Examples

vLLM Serve

vllm serve embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.75 \
    --max-num-seqs 2

vLLM Video Inference

from vllm import LLM, SamplingParams

if __name__ == "__main__":
    model = "embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead"
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."}],
        },
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url, "fps": 4}},
                {"type": "text", "text": "Describe this video in detail."},
            ],
        },
    ]

    llm = LLM(
        model=model,
        limit_mm_per_prompt={
            "video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720},
            "image": 0,
            "audio": 0,
        },
        media_io_kwargs={"video": {"num_frames": -1}},
        max_model_len=8192,
        mm_processor_kwargs={"truncation": False},
        gpu_memory_utilization=0.75,
        max_num_seqs=2,
        trust_remote_code=True,
    )

    output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256))
    print(output[0].outputs[0].text)

License

Built on NVIDIA Cosmos

This model is a derivative of nvidia/Cosmos-Reason2-2B.

Licensed by NVIDIA Corporation under the NVIDIA Open Model License

Upstream: NVIDIA Open Model License
Additional Information: Apache License 2.0
Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise and Commercial Inquiries: models@embedl.com
Technical Issues and Early Access: https://github.com/embedl/flash-head
More Information and Model Releases: https://embedl.com

Partner & Developer Opportunities

If you are evaluating on-device inference, building products on this model, or exploring custom model optimization, reach out for:

Engineering support for on-prem and edge deployments
Early access and partner co-marketing opportunities

Contact: models@embedl.com

Community & support

Need help with this model? Chat with the Embedl team and other engineers on Discord.

Quantization gotchas, hardware questions, fine-tuning tips — bring them all.

Join our Discord →

Downloads last month: 1,281

Safetensors

Model size

2B params

Tensor type

I64

I32

BF16

Model tree for embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

nvidia/Cosmos-Reason2-2B

Quantized

(14)

this model

Collections including embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead