File size: 6,454 Bytes
455f5a7
 
 
 
 
 
 
 
 
3a5ee2a
455f5a7
 
 
 
 
 
3a5ee2a
455f5a7
3a5ee2a
 
 
455f5a7
44ed06b
455f5a7
 
 
 
 
 
 
 
 
 
 
 
 
3a5ee2a
455f5a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0d516b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
455f5a7
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
base_model: nvidia/Cosmos3-Super-Text2Image
library_name: diffusers
pipeline_tag: text-to-image
tags:
  - cosmos3
  - diffusers
  - fp8
  - quanto
  - optimum-quanto
  - text-to-image
license: other
license_name: openmdw1.1-license
license_link: https://openmdw.ai/license/1-1/
---

# Cosmos3-Super-Text2Image Quanto FP8 Transformer

This repository contains a transformer-only FP8/float8 quantization made with Hugging Face Optimum Quanto for [nvidia/Cosmos3-Super-Text2Image](https://huggingface.co/nvidia/Cosmos3-Super-Text2Image).

**This is a Quanto quantization, not an NVIDIA ModelOpt/NVFP quantization.** The separate NVFP experiments should be compared against this repo explicitly as a different quantization backend.

Read NVIDIA's card, license, safety notes, and prompt-format guidance here:
[nvidia/Cosmos3-Super-Text2Image](https://huggingface.co/nvidia/Cosmos3-Super-Text2Image).

Only `transformer/` is provided as a weight artifact. The VAE, scheduler, tokenizers, safety checker, and other components are loaded from the base model.

## Assemble The Pipeline

```python
import json
import torch
from diffusers import Cosmos3OmniPipeline, Cosmos3OmniTransformer
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

transformer = Cosmos3OmniTransformer.from_pretrained(
    "WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer",
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Super-Text2Image",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    enable_safety_checker=True,
)
# Ensure the injected transformer and Cosmos intermediate tensors share CUDA.
pipe.to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=3.0)

# Use the JSON-caption format described by the original model card.
json_caption = {
    "subjects": [],
    "background_setting": "A concise scene description.",
    "comprehensive_t2i_caption": "A detailed natural-language caption.",
    "resolution": {"H": 1024, "W": 1024},
    "aspect_ratio": "1,1",
}

result = pipe(
    prompt=json.dumps(json_caption),
    negative_prompt="",
    num_frames=1,
    height=1024,
    width=1024,
    num_inference_steps=50,
    guidance_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(1143),
)
result.video[0].save("cosmos3_fp8.png")
```

## Benchmarks

Measured on one RunPod NVIDIA B200 instance with local container storage, cached model files, PyTorch `2.9.1+cu130`, 1024x1024 image generation, 50 inference steps, guidance scale 4.0, `flow_shift=3.0`, system prompt enabled.

### Transformer Component Load

This measures loading the transformer component and moving it to CUDA in isolation.

| Variant | Load to CUDA | VRAM after load | Torch allocated | Torch reserved | Transformer safetensors |
| --- | ---: | ---: | ---: | ---: | ---: |
| BF16 base transformer | 23.80s | 122,758 MiB | 122,121 MiB | 122,132 MiB | 119.21 GiB |
| FP8 transformer | 74.45s | 65,756 MiB | 62,356 MiB | 65,036 MiB | 60.35 GiB |

### Full Pipeline Generation

This measures end-to-end Diffusers pipeline loading and generation. The stress set is ten handwritten JSON-caption prompts designed to stress Cyrillic text, reflections, multi-object composition, anatomy, and small details.

| Variant | Full pipeline load | VRAM after load | Torch allocated after load | Avg generation time | Min / max generation time | Peak sampled VRAM | Images |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| BF16 base pipeline | 31.31s | 125,134 MiB | 124,386 MiB | 16.05s | 15.51s / 17.97s | 141,104 MiB | 10 |
| FP8 transformer pipeline | 28.06s | 69,276 MiB | 65,865 MiB | 37.53s | 36.43s / 40.00s | 82,198 MiB | 10 |

### Original NVIDIA Example Caption

The original model repository provides [`assets/example_caption.json`](https://huggingface.co/nvidia/Cosmos3-Super-Text2Image/blob/main/assets/example_caption.json). The images below are generated locally with the same JSON-caption, seed 1143, 1024x1024, 50 steps, guidance scale 4.0.

| Variant | Pipeline load | Generation time | Peak sampled VRAM |
| --- | ---: | ---: | ---: |
| BF16 base pipeline | 35.41s | 18.01s | 141,098 MiB |
| FP8 transformer pipeline | 29.66s | 39.38s | 71,820 MiB |

BF16 reference output:

![BF16 output for NVIDIA example caption](examples/nvidia_example_caption_bf16.png)

FP8 transformer output:

![FP8 output for NVIDIA example caption](examples/nvidia_example_caption_fp8.png)

## Stress Prompt Outputs

These are the ten FP8 outputs from the handwritten JSON-caption stress prompt set used in the benchmark table above. The set stresses Cyrillic signage, exact text placement, reflections, small-object consistency, multi-plane composition, UI panels, and human anatomy.

| # | Stress focus | FP8 output |
| --- | --- | --- |
| 01 | Metro archive reading room | ![Metro archive reading room](examples/01_metro_archive_reading_room_fp8.png) |
| 02 | Arctic greenhouse night shift | ![Arctic greenhouse night shift](examples/02_arctic_greenhouse_night_shift_fp8.png) |
| 03 | Control room restoration | ![Control room restoration](examples/03_control_room_restoration_fp8.png) |
| 04 | Rain market cross section | ![Rain market cross section](examples/04_rain_market_cross_section_fp8.png) |
| 05 | Manuscript restoration table | ![Manuscript restoration table](examples/05_manuscript_restoration_table_fp8.png) |
| 06 | Robotic assembly line signage | ![Robotic assembly line signage](examples/06_robotic_assembly_line_signage_fp8.png) |
| 07 | Kitchen storm chess table | ![Kitchen storm chess table](examples/07_kitchen_storm_chess_table_fp8.png) |
| 08 | Orbital cockpit Cyrillic UI | ![Orbital cockpit Cyrillic UI](examples/08_orbital_cockpit_cyrillic_ui_fp8.png) |
| 09 | Flood command center | ![Flood command center](examples/09_flood_command_center_fp8.png) |
| 10 | Cyrillic newspaper press | ![Cyrillic newspaper press](examples/10_cyrillic_newspaper_press_fp8.png) |

## Notes

- The upstream card documents BF16 as the tested precision. Treat this FP8 transformer as experimental.
- The safety checker is not included in this repo; load it from the base model if your use case requires it.
- Text rendering, especially exact Cyrillic text, remains a difficult case for this model family. Quantization should be evaluated visually for your target prompt distribution.