Cosmos3-Nano — FP8 (8-bit, quality tier)

An FP8 (E4M3) weight-only quantization of nvidia/Cosmos3-Nano, produced with NVIDIA TensorRT Model Optimizer. This is the quality tier: FP8 weights are near-indistinguishable from BF16 and hold up on the hard cases (dense hands, text) where 4-bit can wobble. The transformer's attention + FFN linears + lm_head are FP8; embeddings, norms, time-embedder, and modality adapters stay BF16. Activations stay BF16 (weight-only).

Derivative of nvidia/Cosmos3-Nano. © NVIDIA. Distributed under OpenMDW-1.1 (license + NVIDIA copyright/origin notices retained, per the license). Not affiliated with, nor endorsed by, NVIDIA.

Precision options (pick by hardware)

Build ~Total size Fits 16 GB GPU? Quality
NVFP4-AWQ / INT4-AWQ ~13 GB ✅ (tight, e.g. RTX 5080) near-zero loss; hardest hands/text can wobble
FP8 (this tier) ~18 GB ❌ (needs ~24 GB) near-indistinguishable from BF16
BF16 (original) ~33 GB reference

Quality

FP8 is the standard near-lossless quantization. We confirmed it on the specific hard cases that 4-bit struggled with (the four-friends selfie's hand cluster, interlocking handshake, dense limbs) — FP8 keeps them clean (see fp8_vs_bf16_hardcases.png). Like all quantization (and even a different BF16 seed), it produces a different but equivalent sample, not identical pixels.

Usage

import torch
from huggingface_hub import snapshot_download
from diffusers import Cosmos3OmniPipeline, Cosmos3OmniTransformer
import modelopt.torch.opt as mto

repo = snapshot_download("Reza2kn/Cosmos3-Nano-FP8")
tf = Cosmos3OmniTransformer.from_config(
    Cosmos3OmniTransformer.load_config(f"{repo}/transformer/config.json")).to(torch.bfloat16)
mto.restore(tf, f"{repo}/transformer/modelopt_quantized.pt")
pipe = Cosmos3OmniPipeline.from_pretrained(
    repo, transformer=tf, torch_dtype=torch.bfloat16, enable_safety_checker=False).to("cuda")
with torch.autocast("cuda", dtype=torch.bfloat16):
    img = pipe("A red panda astronaut floating in a nebula", num_frames=1, height=480, width=480).video[0][0]

Or from load_quantized import load; pipe = load(). Requires diffusers (git main/≥0.39), nvidia-modelopt, torch cu128.

Method

modelopt FP8_DEFAULT_CFG, weight-only; calibrated on multimodal image+video prompts through the real denoising loop. Quantized self_attn.*/mlp.*/mlp_moe_gen.*/lm_head; BF16 for embeddings, norms, time_embedder, proj_in/out, audio/action adapters.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Reza2kn/Cosmos3-Nano-FP8

Quantized
(3)
this model