Image-Text-to-Text
Cosmos
Safetensors
qwen3_vl
nvidia
cosmos-reason2
multimodal
vlm
quantized
flashhead
conversational
compressed-tensors
Instructions to use embedl/Cosmos-Reason2-8B-W4A16-FlashHead with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use embedl/Cosmos-Reason2-8B-W4A16-FlashHead with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
The information you provide will be collected, stored, processed and shared in accordance with the Embedl Privacy Policy.
Log in or Sign Up to review the conditions and access this model content.
Cosmos-Reason2-8B-W4A16-FlashHead
Optimized version of nvidia/Cosmos-Reason2-8B using quantization and FlashHead, Embedl's efficient replacement for the language model head.
Designed for low-latency inference on NVIDIA GPUs, leveraging:
- FlashHead
- Quantization (W4A16)
- vLLM plugin via
flash-head
Model Details
| Field | Value |
|---|---|
| Base Model | nvidia/Cosmos-Reason2-8B |
| Input / Output | Text + Image / Video -> Text |
| Optimizations | FlashHead LM Head + Quantization (W4A16) |
| Developers | Embedl |
| Licenses | Upstream: NVIDIA Open Model License. Optimized components: Embedl Models Community Licence v1.0 (no redistribution) |
Benchmarks
Accuracy and on-device latency benchmarks can be explored on embedl/Edge-Inference-Benchmarks.
Installation
pip install flash-head
The flash-head vLLM plugin is required. It activates automatically at startup.
Usage Examples
vLLM Serve
vllm serve embedl/Cosmos-Reason2-8B-W4A16-FlashHead \
--max-model-len 8192 \
--gpu-memory-utilization 0.75
vLLM Video Inference
from vllm import LLM, SamplingParams
if __name__ == "__main__":
model = "embedl/Cosmos-Reason2-8B-W4A16-FlashHead"
video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}],
},
{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": video_url, "fps": 4}},
{"type": "text", "text": "Describe this video in detail."},
],
},
]
llm = LLM(
model=model,
limit_mm_per_prompt={
"video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720},
"image": 0,
"audio": 0,
},
media_io_kwargs={"video": {"num_frames": -1}},
max_model_len=8192,
mm_processor_kwargs={"truncation": False},
gpu_memory_utilization=0.75,
trust_remote_code=True,
)
output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256))
print(output[0].outputs[0].text)
License
- Upstream: NVIDIA Open Model License
- Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)
Contact
- Enterprise and Commercial Inquiries:
models@embedl.com - Technical Issues and Early Access:
https://github.com/embedl/flash-head - More Information and Model Releases:
https://embedl.com
- Downloads last month
- 56
Model tree for embedl/Cosmos-Reason2-8B-W4A16-FlashHead
Collections including embedl/Cosmos-Reason2-8B-W4A16-FlashHead
Collection
nvidia/Cosmos-Reason2 multi-modal reasoning models optimized by Embedl. • 13 items • Updated • 4