---
base_model: nvidia/Cosmos3-Super-Text2Image
library_name: diffusers
pipeline_tag: text-to-image
tags:
  - cosmos3
  - diffusers
  - fp8
  - quanto
  - optimum-quanto
  - text-to-image
license: other
license_name: openmdw1.1-license
license_link: https://openmdw.ai/license/1-1/
---

# Cosmos3-Super-Text2Image Quanto FP8 Transformer

This repository contains a transformer-only FP8/float8 quantization made with Hugging Face Optimum Quanto for [nvidia/Cosmos3-Super-Text2Image](https://huggingface.co/nvidia/Cosmos3-Super-Text2Image).

**This is a Quanto quantization, not an NVIDIA ModelOpt/NVFP quantization.** The separate NVFP experiments should be compared against this repo explicitly as a different quantization backend.

Read NVIDIA's card, license, safety notes, and prompt-format guidance here:
[nvidia/Cosmos3-Super-Text2Image](https://huggingface.co/nvidia/Cosmos3-Super-Text2Image).

Only `transformer/` is provided as a weight artifact. The VAE, scheduler, tokenizers, safety checker, and other components are loaded from the base model.

## Assemble The Pipeline

```python
import json
import torch
from diffusers import Cosmos3OmniPipeline, Cosmos3OmniTransformer
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

transformer = Cosmos3OmniTransformer.from_pretrained(
    "WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer",
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Super-Text2Image",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    enable_safety_checker=True,
)
# Ensure the injected transformer and Cosmos intermediate tensors share CUDA.
pipe.to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=3.0)

# Use the JSON-caption format described by the original model card.
json_caption = {
    "subjects": [],
    "background_setting": "A concise scene description.",
    "comprehensive_t2i_caption": "A detailed natural-language caption.",
    "resolution": {"H": 1024, "W": 1024},
    "aspect_ratio": "1,1",
}

result = pipe(
    prompt=json.dumps(json_caption),
    negative_prompt="",
    num_frames=1,
    height=1024,
    width=1024,
    num_inference_steps=50,
    guidance_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(1143),
)
result.video[0].save("cosmos3_fp8.png")
```

## Benchmarks

Measured on one RunPod NVIDIA B200 instance with local container storage, cached model files, PyTorch `2.9.1+cu130`, 1024x1024 image generation, 50 inference steps, guidance scale 4.0, `flow_shift=3.0`, system prompt enabled.

### Transformer Component Load

This measures loading the transformer component and moving it to CUDA in isolation.

| Variant | Load to CUDA | VRAM after load | Torch allocated | Torch reserved | Transformer safetensors |
| --- | ---: | ---: | ---: | ---: | ---: |
| BF16 base transformer | 23.80s | 122,758 MiB | 122,121 MiB | 122,132 MiB | 119.21 GiB |
| FP8 transformer | 74.45s | 65,756 MiB | 62,356 MiB | 65,036 MiB | 60.35 GiB |

### Full Pipeline Generation

This measures end-to-end Diffusers pipeline loading and generation. The stress set is ten handwritten JSON-caption prompts designed to stress Cyrillic text, reflections, multi-object composition, anatomy, and small details.

| Variant | Full pipeline load | VRAM after load | Torch allocated after load | Avg generation time | Min / max generation time | Peak sampled VRAM | Images |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| BF16 base pipeline | 31.31s | 125,134 MiB | 124,386 MiB | 16.05s | 15.51s / 17.97s | 141,104 MiB | 10 |
| FP8 transformer pipeline | 28.06s | 69,276 MiB | 65,865 MiB | 37.53s | 36.43s / 40.00s | 82,198 MiB | 10 |

### Original NVIDIA Example Caption

The original model repository provides [`assets/example_caption.json`](https://huggingface.co/nvidia/Cosmos3-Super-Text2Image/blob/main/assets/example_caption.json). The images below are generated locally with the same JSON-caption, seed 1143, 1024x1024, 50 steps, guidance scale 4.0.

| Variant | Pipeline load | Generation time | Peak sampled VRAM |
| --- | ---: | ---: | ---: |
| BF16 base pipeline | 35.41s | 18.01s | 141,098 MiB |
| FP8 transformer pipeline | 29.66s | 39.38s | 71,820 MiB |

BF16 reference output:

![BF16 output for NVIDIA example caption](examples/nvidia_example_caption_bf16.png)

FP8 transformer output:

![FP8 output for NVIDIA example caption](examples/nvidia_example_caption_fp8.png)

## Stress Prompt Outputs

These are the ten FP8 outputs from the handwritten JSON-caption stress prompt set used in the benchmark table above. The set stresses Cyrillic signage, exact text placement, reflections, small-object consistency, multi-plane composition, UI panels, and human anatomy.

| # | Stress focus | FP8 output |
| --- | --- | --- |
| 01 | Metro archive reading room | ![Metro archive reading room](examples/01_metro_archive_reading_room_fp8.png) |
| 02 | Arctic greenhouse night shift | ![Arctic greenhouse night shift](examples/02_arctic_greenhouse_night_shift_fp8.png) |
| 03 | Control room restoration | ![Control room restoration](examples/03_control_room_restoration_fp8.png) |
| 04 | Rain market cross section | ![Rain market cross section](examples/04_rain_market_cross_section_fp8.png) |
| 05 | Manuscript restoration table | ![Manuscript restoration table](examples/05_manuscript_restoration_table_fp8.png) |
| 06 | Robotic assembly line signage | ![Robotic assembly line signage](examples/06_robotic_assembly_line_signage_fp8.png) |
| 07 | Kitchen storm chess table | ![Kitchen storm chess table](examples/07_kitchen_storm_chess_table_fp8.png) |
| 08 | Orbital cockpit Cyrillic UI | ![Orbital cockpit Cyrillic UI](examples/08_orbital_cockpit_cyrillic_ui_fp8.png) |
| 09 | Flood command center | ![Flood command center](examples/09_flood_command_center_fp8.png) |
| 10 | Cyrillic newspaper press | ![Cyrillic newspaper press](examples/10_cyrillic_newspaper_press_fp8.png) |

## Notes

- The upstream card documents BF16 as the tested precision. Treat this FP8 transformer as experimental.
- The safety checker is not included in this repo; load it from the base model if your use case requires it.
- Text rendering, especially exact Cyrillic text, remains a difficult case for this model family. Quantization should be evaluated visually for your target prompt distribution.