Cosmos-Reason2-2B-W4A16
Optimized version of nvidia/Cosmos-Reason2-2B using Quantization. Optimized for reduced GPU memory usage and improved inference efficiency while maintaining high-quality multimodal reasoning performance.
This model was created by quantizing the base language model to INT4 weights while keeping activations in FP16 precision. The model preserves the reasoning capabilities of the original Cosmos-Reason2-2B model while significantly reducing the memory footprint of model weights.
Model Details
| Field | Value |
|---|---|
| Base Model | nvidia/Cosmos-Reason2-2B |
| Input / Output | Text + Image / Video → Text |
| Release Date | 2026-02-13 |
| Version | 1.0 |
| Optimizations | Quantization (W4A16) |
| Developers | Embedl |
| Licenses | Upstream: Gemma Terms of Use. Optimized components: Embedl Models Community Licence v1.0 (no redistribution) |
| Intended Use | Text generation, reasoning, assistant-style interaction, video analytics, planning, and general-purpose NLP on NVIDIA GPUs |
Optimizations
- Quantization (W4A16) - large reduction in memory footprint and latency.
Quantization Details
- Algorithm: AWQ
- Scheme: W4A16
- Weight Quantization: INT4 (W4)
- Activation Precision: FP16 (A16)
- Ignored Modules: lm_head and visual
- Calibration Datasets: lmms-lab/flickr30k (image) and gigant/webvid-mini (video)
- Number of Calibration Samples: 1024 (512 image + 512 video)
- Max Sequence Length: 16_384
- Quantization Library: llm-compressor
Performance
Text-only inference
NVIDIA Jetson Orin Nano Super
| Model | e2e(s) | TPS | TPOT(ms) | TTFT(ms) |
|---|---|---|---|---|
| nvidia/Cosmos-Reason2-2B | OOM | OOM | OOM | OOM |
| embedl/Cosmos-Reason2-2B-W4A16 | 4.5489 | 56.28 | 17.42 | 84.61 |
NVIDIA Jetson AGX Orin
| Model | e2e(s) | TPS | TPOT(ms) | TTFT(ms) |
|---|---|---|---|---|
| nvidia/Cosmos-Reason2-2B | 5.6196 | 45.55 | 21.79 | 36.40 |
| embedl/Cosmos-Reason2-2B-W4A16 | 2.5601 | 100.00 | 9.85 | 35.56 |
Video-only inference
NVIDIA Jetson Orin Nano Super
| Model | Resolution | FPS | Frames | e2e(s) | TPS | TPOT(ms) | TTFT(ms) |
|---|---|---|---|---|---|---|---|
| nvidia/Cosmos-Reason2-2B | OOM | OOM | OOM | OOM | OOM | OOM | OOM |
| embedl/Cosmos-Reason2-2B-W4A16 | 854x480 | 2 | 6 | 5.1020 | 50.18 | 18.89 | 154.79 |
| embedl/Cosmos-Reason2-2B-W4A16 | 854x480 | 4 | 12 | 5.1017 | 50.18 | 18.88 | 154.10 |
| embedl/Cosmos-Reason2-2B-W4A16 | 1280x720 | 2 | 6 | 5.8696 | 43.61 | 20.79 | 324.38 |
| embedl/Cosmos-Reason2-2B-W4A16 | 1280x720 | 4 | 12 | 5.8635 | 43.66 | 20.80 | 334.08 |
| embedl/Cosmos-Reason2-2B-W4A16 | 1920x1080 | 2 | 6 | 4.8353 | 31.02 | 25.20 | 650.74 |
| embedl/Cosmos-Reason2-2B-W4A16 | 1920x1080 | 4 | 12 | 4.8483 | 30.94 | 25.23 | 657.56 |
NVIDIA Jetson AGX Orin
| Model | Resolution | FPS | Frames | e2e(s) | TPS | TPOT(ms) | TTFT(ms) |
|---|---|---|---|---|---|---|---|
| nvidia/Cosmos-Reason2-2B | 854x480 | 2 | 6 | 6.0471 | 42.33 | 22.59 | 146.10 |
| embedl/Cosmos-Reason2-2B-W4A16 | 854x480 | 2 | 6 | 3.0156 | 84.89 | 10.77 | 143.20 |
| nvidia/Cosmos-Reason2-2B | 854x480 | 4 | 12 | 6.0397 | 42.39 | 22.63 | 145.79 |
| embedl/Cosmos-Reason2-2B-W4A16 | 854x480 | 4 | 12 | 3.0035 | 85.23 | 10.76 | 144.12 |
| nvidia/Cosmos-Reason2-2B | 1280x720 | 2 | 6 | 6.4872 | 39.46 | 23.58 | 250.17 |
| embedl/Cosmos-Reason2-2B-W4A16 | 1280x720 | 2 | 6 | 3.4448 | 74.31 | 11.75 | 248.53 |
| nvidia/Cosmos-Reason2-2B | 1280x720 | 4 | 12 | 6.4673 | 39.58 | 23.56 | 249.17 |
| embedl/Cosmos-Reason2-2B-W4A16 | 1280x720 | 4 | 12 | 3.4416 | 74.38 | 11.71 | 243.54 |
| nvidia/Cosmos-Reason2-2B | 1920x1080 | 2 | 6 | 7.4784 | 34.23 | 25.92 | 521.39 |
| embedl/Cosmos-Reason2-2B-W4A16 | 1920x1080 | 2 | 6 | 4.4439 | 57.61 | 14.06 | 507.71 |
| nvidia/Cosmos-Reason2-2B | 1920x1080 | 4 | 12 | 7.5190 | 34.05 | 25.93 | 523.36 |
| embedl/Cosmos-Reason2-2B-W4A16 | 1920x1080 | 4 | 12 | 4.4386 | 57.68 | 14.00 | 508.39 |
Image-only inference
NVIDIA Jetson Orin Nano Super
| Model | Resolution | e2e(s) | TPS | TPOT(ms) | TTFT(ms) |
|---|---|---|---|---|---|
| nvidia/Cosmos-Reason2-2B | OOM | OOM | OOM | OOM | OOM |
| embedl/Cosmos-Reason2-2B-W4A16 | 854x480 | 5.5183 | 42.22 | 20.88 | 107.25 |
| embedl/Cosmos-Reason2-2B-W4A16 | 1280x720 | 5.5010 | 42.36 | 20.87 | 105.69 |
| embedl/Cosmos-Reason2-2B-W4A16 | 1920x1080 | 5.4421 | 42.81 | 20.88 | 105.89 |
NVIDIA Jetson AGX Orin
| Model | Resolution | e2e(s) | TPS | TPOT(ms) | TTFT(ms) |
|---|---|---|---|---|---|
| nvidia/Cosmos-Reason2-2B | 854x480 | 6.0213 | 38.03 | 23.66 | 58.22 |
| embedl/Cosmos-Reason2-2B-W4A16 | 854x480 | 3.2648 | 71.06 | 11.76 | 55.98 |
| nvidia/Cosmos-Reason2-2B | 1280x720 | 8.6497 | 26.47 | 23.64 | 68.63 |
| embedl/Cosmos-Reason2-2B-W4A16 | 1280x720 | 3.2759 | 70.82 | 11.80 | 54.90 |
| nvidia/Cosmos-Reason2-2B | 1920x1080 | 5.9558 | 38.45 | 23.63 | 57.30 |
| embedl/Cosmos-Reason2-2B-W4A16 | 1920x1080 | 3.1621 | 73.37 | 11.75 | 53.14 |
Measurement setup: NVIDIA vLLM 0.14.0 for Jetson, batch_size=1, max_new_tokens=256, 5 warm-up runs, averaged over 10 runs.
Performance Metric Definitions
e2e Latency (End-to-End Latency): Total time from request submission to completion of the full generated response. This reflects real user-perceived latency. Lower is better.
TPS (Tokens Per Second): Number of output tokens generated per second during the decoding phase. Higher is better.
TPOT (Time Per Output Token):
Average time (in milliseconds) required to generate one output token during decoding.
Computed as TPOT = (last_token_ts - first_token_ts) / total_output_tokens. Lower is better.
TTFT (Time To First Token): Time from request submission to generation of the first output token. This includes vision encoding, prompt prefill, KV cache initialization. Lower is better.
Usage Examples
Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).
vLLM Video Inference
vLLM image: NVIDIA vLLM 0.14.0 for Jetson
Test Hardware: NVIDIA Jetson AGX Orin
--gpu-memory-utilizationand--max-num-seqsshould be adapted to system specifications (i.e., available RAM).
docker run --rm -it \
--network host \
--shm-size=8g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--runtime=nvidia \
--name=vllm-serve \
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin \
vllm serve "embedl/Cosmos-Reason2-2B-W4A16" \
--max-model-len 8192 \
--gpu-memory-utilization 0.75 \
--max-num-seqs 2
Test Hardware: NVIDIA Jetson AGX Orin, NVIDIA Jetson Orin Nano Super
gpu_memory_utilizationandmax_num_seqsshould be adapted to system specifications (i.e., available RAM).
from vllm import LLM, SamplingParams
if __name__ == "__main__":
model = "embedl/Cosmos-Reason2-2B-W4A16"
video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
],
},
{
"role": "user",
"content": [
{
"type": "video_url",
"video_url": {"url": video_url, "fps": 4},
},
{
"type": "text",
"text": "Describe this video in detail.",
},
],
},
]
llm = LLM(
model=model,
limit_mm_per_prompt={
"video": {
"count": 1,
"num_frames": 12,
"width": 1920,
"height": 1080,
},
"image": 0,
"audio": 0,
},
media_io_kwargs={"video": {"num_frames": -1}},
max_model_len=8192,
mm_processor_kwargs={"truncation": False},
# System-specific settings - Adapt depending on available RAM
disable_log_stats=False,
gpu_memory_utilization=0.75,
max_num_seqs=2,
)
output = llm.chat(
messages,
sampling_params=SamplingParams(temperature=0.0, max_tokens=256),
)
print(output[0].outputs[0].text)
Transformers Inference
Test Hardware: NVIDIA L4 GPU
Adapted from nvidia/Cosmos-Reason2-2B.
import torch
import transformers
if __name__ == "__main__":
model_name = "embedl/Cosmos-Reason2-2B-W4A16"
model = transformers.Qwen3VLForConditionalGeneration.from_pretrained(
model_name,
device_map="auto",
attn_implementation="sdpa",
)
processor: transformers.Qwen3VLProcessor = (
transformers.AutoProcessor.from_pretrained(model_name)
)
video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"
video_messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
],
},
{
"role": "user",
"content": [
{"type": "video", "video": video_url, "fps": 4},
{"type": "text", "text": "Describe this video in detail."},
],
},
]
# Process inputs
inputs = processor.apply_chat_template(
video_messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
truncation=False,
fps=4,
)
inputs = inputs.to(model.device)
# Run inference
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
out_ids[len(in_ids) :]
for in_ids, out_ids in zip(
inputs.input_ids, generated_ids, strict=False
)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(output_text[0])
License
Built on NVIDIA Cosmos
This model is a derivative of nvidia/Cosmos-Reason2-2B.
Licensed by NVIDIA Corporation under the NVIDIA Open Model License
- Upstream: NVIDIA Open Model License
- Additional Information: Apache License 2.0
- Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)
Contact
Enterprise & Commercial Inquiries sales@embedl.com
Technical Issues & Early Access https://github.com/embedl/embedl-models
More Information & Model Releases https://embedl.com
Partner & Developer Opportunities
If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:
- Engineering support for on-prem/edge deployments
- Early access & partner co-marketing opportunities
Contact: sales@embedl.com
- Downloads last month
- 5