Clarify Quanto FP8 quantization

3a5ee2a verified 2 days ago

6.45 kB

	---
	base_model: nvidia/Cosmos3-Super-Text2Image
	library_name: diffusers
	pipeline_tag: text-to-image
	tags:
	- cosmos3
	- diffusers
	- fp8
	- quanto
	- optimum-quanto
	- text-to-image
	license: other
	license_name: openmdw1.1-license
	license_link: https://openmdw.ai/license/1-1/
	---

	# Cosmos3-Super-Text2Image Quanto FP8 Transformer

	This repository contains a transformer-only FP8/float8 quantization made with Hugging Face Optimum Quanto for [nvidia/Cosmos3-Super-Text2Image](https://huggingface.co/nvidia/Cosmos3-Super-Text2Image).

	This is a Quanto quantization, not an NVIDIA ModelOpt/NVFP quantization. The separate NVFP experiments should be compared against this repo explicitly as a different quantization backend.

	Read NVIDIA's card, license, safety notes, and prompt-format guidance here:
	[nvidia/Cosmos3-Super-Text2Image](https://huggingface.co/nvidia/Cosmos3-Super-Text2Image).

	Only `transformer/` is provided as a weight artifact. The VAE, scheduler, tokenizers, safety checker, and other components are loaded from the base model.

	## Assemble The Pipeline

	```python
	import json
	import torch
	from diffusers import Cosmos3OmniPipeline, Cosmos3OmniTransformer
	from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

	transformer = Cosmos3OmniTransformer.from_pretrained(
	"WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer",
	subfolder="transformer",
	torch_dtype=torch.bfloat16,
	)

	pipe = Cosmos3OmniPipeline.from_pretrained(
	"nvidia/Cosmos3-Super-Text2Image",
	transformer=transformer,
	torch_dtype=torch.bfloat16,
	device_map="cuda",
	enable_safety_checker=True,
	)
	# Ensure the injected transformer and Cosmos intermediate tensors share CUDA.
	pipe.to("cuda")
	pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=3.0)

	# Use the JSON-caption format described by the original model card.
	json_caption = {
	"subjects": [],
	"background_setting": "A concise scene description.",
	"comprehensive_t2i_caption": "A detailed natural-language caption.",
	"resolution": {"H": 1024, "W": 1024},
	"aspect_ratio": "1,1",
	}

	result = pipe(
	prompt=json.dumps(json_caption),
	negative_prompt="",
	num_frames=1,
	height=1024,
	width=1024,
	num_inference_steps=50,
	guidance_scale=4.0,
	generator=torch.Generator(device="cuda").manual_seed(1143),
	)
	result.video[0].save("cosmos3_fp8.png")
	```

	## Benchmarks

	Measured on one RunPod NVIDIA B200 instance with local container storage, cached model files, PyTorch `2.9.1+cu130`, 1024x1024 image generation, 50 inference steps, guidance scale 4.0, `flow_shift=3.0`, system prompt enabled.

	### Transformer Component Load

	This measures loading the transformer component and moving it to CUDA in isolation.

	\| Variant \| Load to CUDA \| VRAM after load \| Torch allocated \| Torch reserved \| Transformer safetensors \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| BF16 base transformer \| 23.80s \| 122,758 MiB \| 122,121 MiB \| 122,132 MiB \| 119.21 GiB \|
	\| FP8 transformer \| 74.45s \| 65,756 MiB \| 62,356 MiB \| 65,036 MiB \| 60.35 GiB \|

	### Full Pipeline Generation

	This measures end-to-end Diffusers pipeline loading and generation. The stress set is ten handwritten JSON-caption prompts designed to stress Cyrillic text, reflections, multi-object composition, anatomy, and small details.

	\| Variant \| Full pipeline load \| VRAM after load \| Torch allocated after load \| Avg generation time \| Min / max generation time \| Peak sampled VRAM \| Images \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| BF16 base pipeline \| 31.31s \| 125,134 MiB \| 124,386 MiB \| 16.05s \| 15.51s / 17.97s \| 141,104 MiB \| 10 \|
	\| FP8 transformer pipeline \| 28.06s \| 69,276 MiB \| 65,865 MiB \| 37.53s \| 36.43s / 40.00s \| 82,198 MiB \| 10 \|

	### Original NVIDIA Example Caption

	The original model repository provides [`assets/example_caption.json`](https://huggingface.co/nvidia/Cosmos3-Super-Text2Image/blob/main/assets/example_caption.json). The images below are generated locally with the same JSON-caption, seed 1143, 1024x1024, 50 steps, guidance scale 4.0.

	\| Variant \| Pipeline load \| Generation time \| Peak sampled VRAM \|
	\| --- \| ---: \| ---: \| ---: \|
	\| BF16 base pipeline \| 35.41s \| 18.01s \| 141,098 MiB \|
	\| FP8 transformer pipeline \| 29.66s \| 39.38s \| 71,820 MiB \|

	BF16 reference output:

	![BF16 output for NVIDIA example caption](examples/nvidia_example_caption_bf16.png)

	FP8 transformer output:

	![FP8 output for NVIDIA example caption](examples/nvidia_example_caption_fp8.png)

	## Stress Prompt Outputs

	These are the ten FP8 outputs from the handwritten JSON-caption stress prompt set used in the benchmark table above. The set stresses Cyrillic signage, exact text placement, reflections, small-object consistency, multi-plane composition, UI panels, and human anatomy.

	\| # \| Stress focus \| FP8 output \|
	\| --- \| --- \| --- \|
	\| 01 \| Metro archive reading room \| ![Metro archive reading room](examples/01_metro_archive_reading_room_fp8.png) \|
	\| 02 \| Arctic greenhouse night shift \| ![Arctic greenhouse night shift](examples/02_arctic_greenhouse_night_shift_fp8.png) \|
	\| 03 \| Control room restoration \| ![Control room restoration](examples/03_control_room_restoration_fp8.png) \|
	\| 04 \| Rain market cross section \| ![Rain market cross section](examples/04_rain_market_cross_section_fp8.png) \|
	\| 05 \| Manuscript restoration table \| ![Manuscript restoration table](examples/05_manuscript_restoration_table_fp8.png) \|
	\| 06 \| Robotic assembly line signage \| ![Robotic assembly line signage](examples/06_robotic_assembly_line_signage_fp8.png) \|
	\| 07 \| Kitchen storm chess table \| ![Kitchen storm chess table](examples/07_kitchen_storm_chess_table_fp8.png) \|
	\| 08 \| Orbital cockpit Cyrillic UI \| ![Orbital cockpit Cyrillic UI](examples/08_orbital_cockpit_cyrillic_ui_fp8.png) \|
	\| 09 \| Flood command center \| ![Flood command center](examples/09_flood_command_center_fp8.png) \|
	\| 10 \| Cyrillic newspaper press \| ![Cyrillic newspaper press](examples/10_cyrillic_newspaper_press_fp8.png) \|

	## Notes

	- The upstream card documents BF16 as the tested precision. Treat this FP8 transformer as experimental.
	- The safety checker is not included in this repo; load it from the base model if your use case requires it.
	- Text rendering, especially exact Cyrillic text, remains a difficult case for this model family. Quantization should be evaluated visually for your target prompt distribution.