Cosmos3-Nano — FP8 (safetensors)

Weight-only FP8 (E4M3) quantization of the Cosmos3OmniTransformer for Cosmos3-Nano, delivered as safetensors. Produced by Session 3 (safetensors export + diffusers load path). The transformer drops from ~30 GB (bf16) to ~15 GB; VAE, vision encoder, tokenizers, and scheduler remain bf16. Runs the diffusers Cosmos3OmniPipeline on a single RTX 5090 (32 GB).

Load

from load_quantized import load           # self-contained; needs torch, diffusers, modelopt, safetensors
pipe = load(".")                           # this directory
import torch
with torch.autocast("cuda", torch.bfloat16):
    img = pipe("a corgi astronaut", num_frames=1, height=480, width=480).video[0][0]
img.save("out.png")

Format (Path B)

The transformer is serialized as safetensors plus a tiny structural sidecar:

File	Contents
`transformer/diffusion_pytorch_model.safetensors`	505 weight-only E4M3 weights + per-tensor `weight_quantizer._amax` / `._scale` buffers + bf16 keep-modules (1819 tensors)
`transformer/modelopt_state.pt`	724 KB tensor-free ModelOpt structural state (quantizer layout) — needed to rebuild the quantizer modules
`transformer/config.json`	transformer config (`action_gen=false`)
`quantization_config.json`	recipe, exclusions, and the `scale_layout` (key suffixes, counts, granularity)
`transformer/modelopt_quantized.pt`	retained fallback — the ModelOpt `.pt`, loadable via `modelopt.torch.opt.restore`

Load = from_config (action_gen=False) → modelopt.torch.opt.restore_from_modelopt_state → load_state_dict(strict=True). The loader reads only the safetensors + sidecar — never the .pt.

Security: modelopt_state.pt (and the retained modelopt_quantized.pt) are loaded with torch.load(weights_only=False), which executes pickle. Load this checkpoint only from a source you trust — a tampered sidecar is remote code execution at load time. The *.safetensors weights are safe; only the small structural sidecar uses pickle.

Why a sidecar instead of pure export_hf_checkpoint? ModelOpt's unified HF export (diffusers dispatch) does not recognize Cosmos3OmniTransformer and drops the per-tensor FP8 scales, so its safetensors cannot be dequantized. Path B (above) preserves them. See docs/reports/session_3.md.

Recipe & scope (INV-2 / INV-3)

Weight-only FP8 E4M3 (activation quantizers disabled). Quantized: self_attn.*, mlp.*, mlp_moe_gen.*, lm_head (505 Linears). Kept bf16: token embeddings, all norms, time_embedder, proj_in/proj_out, audio/action adapters.

Equivalence

Reproduces the ModelOpt .pt (and thus the NVIDIA-style reference FP8 recipe) bitwise: weight round-trip max-abs-diff 0.0 (1812 tensors); pipeline latent error M1 = 0.0 and LPIPS = 0.0 on EC-01..04 at 8 steps and EC-01 at 35 steps (seed 123, UniPC flow_shift=10.0, 1f/480²).

Limitations

action_gen=False build (matches the reference quantized checkpoint, whose .pt is action-adapter-stripped). No action-conditioned generation from this checkpoint.
Verification at the smoke setting (1 frame / 480×480); full-res 720p/189-frame is out of scope.
FP8 compute is ModelOpt fake-quant (compute in bf16); real-FP4/FP8 kernel speedups are out of scope.

Downloads last month: 32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wfen/Cosmos3-Nano-FP8

Base model

nvidia/Cosmos3-Nano

Finetuned

(10)

this model