Cosmos-Reason2-2B-W4A16-Edge2-FlashHead
Optimized version of nvidia/Cosmos-Reason2-2B using quantization, targeted mixed-precision exclusions, and FlashHead, Embedl's efficient replacement for the language model head.
This model builds on embedl/Cosmos-Reason2-2B-W4A16-Edge2 and adds FlashHead, a lightweight replacement for the dense LM head that significantly improves throughput while preserving accuracy.
This model was created by quantizing the base language model to INT4 weights while keeping activations and a small selection of sensitive layers in FP16 precision. The model preserves Text + Image / Video -> Text behavior and reasoning capabilities while reducing memory footprint, enabling the Cosmos-Reason2 model to run even on an NVIDIA Jetson Orin Nano Super.
FlashHead requires the Embedl Docker container for inference on edge devices. See the Usage Examples section for available container images per platform.
Output Example
Real output captured from NVIDIA Jetson Orin Nano Super.
system: You are a helpful assistant.
user: Describe this video in detail: https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4
assistant: The video showcases a dynamic scene of vehicular movement along a multi-laned highway, set against a picturesque backdrop of leafless trees and barren landscapes indicative of a winter or early spring setting. The highway features two lanes in each direction, separated by a median strip bordered by grassy embankments. Vehicles are traveling in both directions, with traffic flowing smoothly in the opposite directions.
In the foreground, a white sedan is prominently visible, moving away from the camera towards the right side of the frame. Its rear lights are illuminated, suggesting it might be braking or slowing down. Further down the road, a red hatchback follows closely behind, maintaining a safe distance from the white sedan. Further ahead, a silver hatchback is seen driving in the same direction as the white sedan, maintaining a consistent pace.
On the opposite side of the highway, a white pickup truck is traveling in the same direction as the white sedan, maintaining a steady speed. Further ahead, a blue hatchback is also traveling in the same direction as the white sedan, keeping a safe distance from the other vehicles. Additional vehicles can be seen further down the road, including a red hatchback and a black hatchback, all moving in the same direction as the white sedan.
Model Details
| Field | Value |
|---|---|
| Model | embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead |
| Base Model | nvidia/Cosmos-Reason2-2B |
| Input / Output | Text + Image / Video -> Text |
| Release Date | 2026-03-11 |
| Version | 1.0 |
| Optimizations | FlashHead LM Head + Quantization (W4A16) + Mixed Precision Exclusions (Edge2) |
| Developers | Embedl |
| Licenses | Upstream: NVIDIA Open Model License. Additional Information: Apache License 2.0. Optimized components: Embedl Models Community Licence v1.0 (no redistribution) |
| Intended Use | Text generation, reasoning, assistant-style interaction, video analytics, planning, and general-purpose multimodal NLP on NVIDIA GPUs |
Optimizations
- FlashHead LM Head: lightweight replacement for the dense LM head, significantly improving throughput.
- Quantization (W4A16): INT4 weights + FP16 activations.
- Edge2 mixed exclusions: Keep sensitive layers in FP16 precision.
Performance
On-device performance benchmarks can be explored on embedl/Edge-Inference-Benchmarks.
Usage Examples
FlashHead requires the Embedl Docker container with patched vLLM.
Platform Image NVIDIA Jetson Orin embedl/vllm:latest-jetson-orin-flashheadNVIDIA Jetson Thor embedl/vllm:latest-jetson-thor-flashhead
vLLM Video Inference
Docker image: embedl/vllm:latest-jetson-orin-flashhead
Test Hardware (serve command): NVIDIA Jetson AGX Orin
docker run --rm -it \
--network host \
--shm-size=8g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--runtime=nvidia \
--name=vllm-serve \
-e HF_TOKEN=hf_*** \
-e HF_HOME=/root/.cache/huggingface \
embedl/vllm:latest-jetson-orin-flashhead \
vllm serve "embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead" \
--max-model-len 8192 \
--gpu-memory-utilization 0.75 \
--max-num-seqs 2 \
--trust-remote-code
Test Hardware (Python example): NVIDIA Jetson AGX Orin, NVIDIA Jetson Orin Nano Super
from vllm import LLM, SamplingParams
if __name__ == "__main__":
model = "embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead"
video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}],
},
{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": video_url, "fps": 4}},
{"type": "text", "text": "Describe this video in detail."},
],
},
]
llm = LLM(
model=model,
limit_mm_per_prompt={
"video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720},
"image": 0,
"audio": 0,
},
media_io_kwargs={"video": {"num_frames": -1}},
max_model_len=8192,
mm_processor_kwargs={"truncation": False},
gpu_memory_utilization=0.75,
max_num_seqs=2,
trust_remote_code=True,
)
output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256))
print(output[0].outputs[0].text)
License
Built on NVIDIA Cosmos
This model is a derivative of nvidia/Cosmos-Reason2-2B.
Licensed by NVIDIA Corporation under the NVIDIA Open Model License
- Upstream: NVIDIA Open Model License
- Additional Information: Apache License 2.0
- Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)
Contact
- Enterprise and Commercial Inquiries:
models@embedl.com - Technical Issues and Early Access:
https://github.com/embedl/embedl-models - More Information and Model Releases:
https://embedl.com
Partner & Developer Opportunities
If you are evaluating on-device inference, building products on this model, or exploring custom model optimization, reach out for:
- Engineering support for on-prem and edge deployments
- Early access and partner co-marketing opportunities
Contact: models@embedl.com
- Downloads last month
- 603