Cosmos3-Nano — NVFP4-AWQ (4-bit, Blackwell-native)

A NVFP4 (4-bit) weight-only, activation-aware quantization of nvidia/Cosmos3-Nano, produced with NVIDIA TensorRT Model Optimizer. NVFP4 is the Blackwell-native 4-bit format (E2M1 with FP8 block scales). The transformer's attention + FFN linears (~11.8 B, 77.6%) are NVFP4; embeddings, norms, the diffusion time-embedder, and modality adapters stay BF16. Activations stay BF16 (weight-only).

Derivative of nvidia/Cosmos3-Nano. © NVIDIA. Distributed under OpenMDW-1.1 (license + NVIDIA copyright/origin notices retained, per the license). Not affiliated with, nor endorsed by, NVIDIA.

Precision options (pick by hardware)

Build	~Total size	Fits 16 GB GPU?	Quality
NVFP4-AWQ / INT4-AWQ (this tier)	~13 GB	✅ (tight)	near-zero loss; hardest hands/text can wobble
FP8	~18 GB	❌ (~24 GB)	near-indistinguishable from BF16
BF16 (original)	~33 GB	❌	reference

Quality vs BF16 (96-prompt anatomy-weighted sweep)

Metric	BF16	NVFP4-AWQ
PickScore (human pref)	21.85	21.82 (Δ −0.03)
FID vs BF16	—	80.6 (best distribution match of all 4-bit recipes)
Functional fidelity (velocity cosine)	1.000	~0.998

FID context: BF16-vs-BF16 at a different seed (same prompts, N=96) = 138.6. NVFP4-AWQ's 80.6 is well below that seed-noise floor — it tracks BF16 more closely than BF16 tracks itself across seeds. Caveat: dense interlocking hands / on-image text can still wobble (base-model-hard, present in BF16 too). See worst_case_contact_sheet.png.

Usage

import torch
from huggingface_hub import snapshot_download
from diffusers import Cosmos3OmniPipeline, Cosmos3OmniTransformer
import modelopt.torch.opt as mto

repo = snapshot_download("Reza2kn/Cosmos3-Nano-NVFP4-AWQ")
tf = Cosmos3OmniTransformer.from_config(
    Cosmos3OmniTransformer.load_config(f"{repo}/transformer/config.json")).to(torch.bfloat16)
mto.restore(tf, f"{repo}/transformer/modelopt_quantized.pt")
pipe = Cosmos3OmniPipeline.from_pretrained(
    repo, transformer=tf, torch_dtype=torch.bfloat16, enable_safety_checker=False).to("cuda")
with torch.autocast("cuda", dtype=torch.bfloat16):       # required (float32 rotary -> bf16 linears)
    img = pipe("A red panda astronaut floating in a nebula", num_frames=1, height=480, width=480).video[0][0]

Or from load_quantized import load; pipe = load(). Requires diffusers (git main/≥0.39), nvidia-modelopt, torch cu128. Best on Blackwell (sm_120) for native NVFP4; runs elsewhere via modelopt dequant.

Method

modelopt NVFP4_AWQ_LITE_CFG (awq_lite), weight-only; calibrated on multimodal image+video prompts through the real denoising loop. Quantized self_attn.*/mlp.*/mlp_moe_gen.*/lm_head; BF16 for embeddings, norms, time_embedder, proj_in/out, audio/action adapters.

Downloads last month: 19

Model tree for Reza2kn/Cosmos3-Nano-NVFP4-AWQ

Base model

nvidia/Cosmos3-Nano

Quantized

(9)

this model