Cosmos3-Nano โ€” NVFP4-AWQ (4-bit, Blackwell-native)

A NVFP4 (4-bit) weight-only, activation-aware quantization of nvidia/Cosmos3-Nano, produced with NVIDIA TensorRT Model Optimizer. NVFP4 is the Blackwell-native 4-bit format (E2M1 with FP8 block scales). The transformer's attention + FFN linears (~11.8 B, 77.6%) are NVFP4; embeddings, norms, the diffusion time-embedder, and modality adapters stay BF16. Activations stay BF16 (weight-only).

Derivative of nvidia/Cosmos3-Nano. ยฉ NVIDIA. Distributed under OpenMDW-1.1 (license + NVIDIA copyright/origin notices retained, per the license). Not affiliated with, nor endorsed by, NVIDIA.

Precision options (pick by hardware)

Build ~Total size Fits 16 GB GPU? Quality
NVFP4-AWQ / INT4-AWQ (this tier) ~13 GB โœ… (tight) near-zero loss; hardest hands/text can wobble
FP8 ~18 GB โŒ (~24 GB) near-indistinguishable from BF16
BF16 (original) ~33 GB โŒ reference

Quality vs BF16 (96-prompt anatomy-weighted sweep)

Metric BF16 NVFP4-AWQ
PickScore (human pref) 21.85 21.82 (ฮ” โˆ’0.03)
FID vs BF16 โ€” 80.6 (best distribution match of all 4-bit recipes)
Functional fidelity (velocity cosine) 1.000 ~0.998

FID context: BF16-vs-BF16 at a different seed (same prompts, N=96) = 138.6. NVFP4-AWQ's 80.6 is well below that seed-noise floor โ€” it tracks BF16 more closely than BF16 tracks itself across seeds. Caveat: dense interlocking hands / on-image text can still wobble (base-model-hard, present in BF16 too). See worst_case_contact_sheet.png.

Usage

import torch
from huggingface_hub import snapshot_download
from diffusers import Cosmos3OmniPipeline, Cosmos3OmniTransformer
import modelopt.torch.opt as mto

repo = snapshot_download("Reza2kn/Cosmos3-Nano-NVFP4-AWQ")
tf = Cosmos3OmniTransformer.from_config(
    Cosmos3OmniTransformer.load_config(f"{repo}/transformer/config.json")).to(torch.bfloat16)
mto.restore(tf, f"{repo}/transformer/modelopt_quantized.pt")
pipe = Cosmos3OmniPipeline.from_pretrained(
    repo, transformer=tf, torch_dtype=torch.bfloat16, enable_safety_checker=False).to("cuda")
with torch.autocast("cuda", dtype=torch.bfloat16):       # required (float32 rotary -> bf16 linears)
    img = pipe("A red panda astronaut floating in a nebula", num_frames=1, height=480, width=480).video[0][0]

Or from load_quantized import load; pipe = load(). Requires diffusers (git main/โ‰ฅ0.39), nvidia-modelopt, torch cu128. Best on Blackwell (sm_120) for native NVFP4; runs elsewhere via modelopt dequant.

Method

modelopt NVFP4_AWQ_LITE_CFG (awq_lite), weight-only; calibrated on multimodal image+video prompts through the real denoising loop. Quantized self_attn.*/mlp.*/mlp_moe_gen.*/lm_head; BF16 for embeddings, norms, time_embedder, proj_in/out, audio/action adapters.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Reza2kn/Cosmos3-Nano-NVFP4-AWQ

Quantized
(5)
this model