How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("wfen/Cosmos3-Nano-FP8", dtype=torch.bfloat16, device_map="cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]

Cosmos3-Nano โ€” FP8 (safetensors)

Weight-only FP8 (E4M3) quantization of the Cosmos3OmniTransformer for Cosmos3-Nano, delivered as safetensors. Produced by Session 3 (safetensors export + diffusers load path). The transformer drops from ~30 GB (bf16) to ~15 GB; VAE, vision encoder, tokenizers, and scheduler remain bf16. Runs the diffusers Cosmos3OmniPipeline on a single RTX 5090 (32 GB).

Load

from load_quantized import load           # self-contained; needs torch, diffusers, modelopt, safetensors
pipe = load(".")                           # this directory
import torch
with torch.autocast("cuda", torch.bfloat16):
    img = pipe("a corgi astronaut", num_frames=1, height=480, width=480).video[0][0]
img.save("out.png")

Format (Path B)

The transformer is serialized as safetensors plus a tiny structural sidecar:

File Contents
transformer/diffusion_pytorch_model.safetensors 505 weight-only E4M3 weights + per-tensor weight_quantizer._amax / ._scale buffers + bf16 keep-modules (1819 tensors)
transformer/modelopt_state.pt 724 KB tensor-free ModelOpt structural state (quantizer layout) โ€” needed to rebuild the quantizer modules
transformer/config.json transformer config (action_gen=false)
quantization_config.json recipe, exclusions, and the scale_layout (key suffixes, counts, granularity)
transformer/modelopt_quantized.pt retained fallback โ€” the ModelOpt .pt, loadable via modelopt.torch.opt.restore

Load = from_config (action_gen=False) โ†’ modelopt.torch.opt.restore_from_modelopt_state โ†’ load_state_dict(strict=True). The loader reads only the safetensors + sidecar โ€” never the .pt.

Security: modelopt_state.pt (and the retained modelopt_quantized.pt) are loaded with torch.load(weights_only=False), which executes pickle. Load this checkpoint only from a source you trust โ€” a tampered sidecar is remote code execution at load time. The *.safetensors weights are safe; only the small structural sidecar uses pickle.

Why a sidecar instead of pure export_hf_checkpoint? ModelOpt's unified HF export (diffusers dispatch) does not recognize Cosmos3OmniTransformer and drops the per-tensor FP8 scales, so its safetensors cannot be dequantized. Path B (above) preserves them. See docs/reports/session_3.md.

Recipe & scope (INV-2 / INV-3)

Weight-only FP8 E4M3 (activation quantizers disabled). Quantized: self_attn.*, mlp.*, mlp_moe_gen.*, lm_head (505 Linears). Kept bf16: token embeddings, all norms, time_embedder, proj_in/proj_out, audio/action adapters.

Equivalence

Reproduces the ModelOpt .pt (and thus the NVIDIA-style reference FP8 recipe) bitwise: weight round-trip max-abs-diff 0.0 (1812 tensors); pipeline latent error M1 = 0.0 and LPIPS = 0.0 on EC-01..04 at 8 steps and EC-01 at 35 steps (seed 123, UniPC flow_shift=10.0, 1f/480ยฒ).

Limitations

  • action_gen=False build (matches the reference quantized checkpoint, whose .pt is action-adapter-stripped). No action-conditioned generation from this checkpoint.
  • Verification at the smoke setting (1 frame / 480ร—480); full-res 720p/189-frame is out of scope.
  • FP8 compute is ModelOpt fake-quant (compute in bf16); real-FP4/FP8 kernel speedups are out of scope.
Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for wfen/Cosmos3-Nano-FP8

Finetuned
(10)
this model