You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

The information you provide will be collected, stored, processed and shared in accordance with the Embedl Privacy Policy.

Optimized by Embedl

Need to fine-tune, hit performance targets, or deploy on specific hardware?

We've got you covered.

Learn more Get in touch →

Cosmos-Reason2-8B-W4A16-FlashHead

Optimized version of nvidia/Cosmos-Reason2-8B using quantization and FlashHead, Embedl's efficient replacement for the language model head.

Designed for low-latency inference on NVIDIA GPUs, leveraging:

FlashHead
Quantization (W4A16)
vLLM plugin via flash-head

Model Details

Field	Value
Base Model	nvidia/Cosmos-Reason2-8B
Input / Output	Text + Image / Video -> Text
Optimizations	FlashHead LM Head + Quantization (W4A16)
Developers	Embedl
Licenses	Upstream: NVIDIA Open Model License. Optimized components: Embedl Models Community Licence v1.0 (no redistribution)

Architecture graph for embedl/Cosmos-Reason2-8B-W4A16-FlashHead. Open in hfviewer

Benchmarks

Accuracy and on-device latency benchmarks can be explored on embedl/Edge-Inference-Benchmarks.

Installation

pip install flash-head

The flash-head vLLM plugin is required. It activates automatically at startup.

Usage Examples

vLLM Serve

vllm serve embedl/Cosmos-Reason2-8B-W4A16-FlashHead \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.75

vLLM Video Inference

from vllm import LLM, SamplingParams

if __name__ == "__main__":
    model = "embedl/Cosmos-Reason2-8B-W4A16-FlashHead"
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."}],
        },
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url, "fps": 4}},
                {"type": "text", "text": "Describe this video in detail."},
            ],
        },
    ]

    llm = LLM(
        model=model,
        limit_mm_per_prompt={
            "video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720},
            "image": 0,
            "audio": 0,
        },
        media_io_kwargs={"video": {"num_frames": -1}},
        max_model_len=8192,
        mm_processor_kwargs={"truncation": False},
        gpu_memory_utilization=0.75,
        trust_remote_code=True,
    )

    output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256))
    print(output[0].outputs[0].text)