You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

The information you provide will be collected, stored, processed and shared in accordance with the Embedl Privacy Policy.

Log in or Sign Up to review the conditions and access this model content.

Optimized by Embedl
Need to fine-tune, hit performance targets, or deploy on specific hardware?
We've got you covered.
Learn more Get in touch →

Cosmos-Reason2-8B-W4A16-FlashHead

GitHub

Optimized version of nvidia/Cosmos-Reason2-8B using quantization and FlashHead, Embedl's efficient replacement for the language model head.

Designed for low-latency inference on NVIDIA GPUs, leveraging:

  • FlashHead
  • Quantization (W4A16)
  • vLLM plugin via flash-head

Model Details

Field Value
Base Model nvidia/Cosmos-Reason2-8B
Input / Output Text + Image / Video -> Text
Optimizations FlashHead LM Head + Quantization (W4A16)
Developers Embedl
Licenses Upstream: NVIDIA Open Model License.
Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Architecture graph for embedl/Cosmos-Reason2-8B-W4A16-FlashHead. Open in hfviewer

Benchmarks

Accuracy and on-device latency benchmarks can be explored on embedl/Edge-Inference-Benchmarks.

Screenshot Edge Inference Benchmarks

Installation

pip install flash-head

The flash-head vLLM plugin is required. It activates automatically at startup.


Usage Examples

vLLM Serve

vllm serve embedl/Cosmos-Reason2-8B-W4A16-FlashHead \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.75

vLLM Video Inference

from vllm import LLM, SamplingParams

if __name__ == "__main__":
    model = "embedl/Cosmos-Reason2-8B-W4A16-FlashHead"
    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"

    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."}],
        },
        {
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url, "fps": 4}},
                {"type": "text", "text": "Describe this video in detail."},
            ],
        },
    ]

    llm = LLM(
        model=model,
        limit_mm_per_prompt={
            "video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720},
            "image": 0,
            "audio": 0,
        },
        media_io_kwargs={"video": {"num_frames": -1}},
        max_model_len=8192,
        mm_processor_kwargs={"truncation": False},
        gpu_memory_utilization=0.75,
        trust_remote_code=True,
    )

    output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256))
    print(output[0].outputs[0].text)

License


Contact

  • Enterprise and Commercial Inquiries: models@embedl.com
  • Technical Issues and Early Access: https://github.com/embedl/flash-head
  • More Information and Model Releases: https://embedl.com
Community & support
Need help with this model? Chat with the Embedl team and other engineers on Discord.
Quantization gotchas, hardware questions, fine-tuning tips — bring them all.
Join our Discord →
Downloads last month
56
Safetensors
Model size
3B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/Cosmos-Reason2-8B-W4A16-FlashHead

Quantized
(12)
this model

Collections including embedl/Cosmos-Reason2-8B-W4A16-FlashHead