File size: 7,273 Bytes
8f26642
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc005bc
8f26642
 
 
 
 
 
 
 
 
 
 
dc005bc
 
8f26642
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
language:
- en
license: other
library_name: diffusers
pipeline_tag: text-to-image
tags:
- text-to-image
- diffusers
- quanto
- int8
- z-image
- transformer-quantization
base_model:
- Tongyi-MAI/Z-Image
base_model_relation: quantized
---

# Z-Image INT8 (Quanto)

This repository provides an INT8-quantized variant of [Tongyi-MAI/Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image):
- **Only** the `transformer` is quantized with **Quanto weight-only INT8**.
- `text_encoder`, `vae`, `scheduler`, and `tokenizer` remain unchanged.
- Inference API stays compatible with `diffusers.ZImagePipeline`.

> Please follow the original upstream model license and usage terms. `license: other` means this repo inherits upstream licensing constraints.

## Model Details

- **Base model**: `Tongyi-MAI/Z-Image`
- **Quantization method**: `optimum-quanto` (weight-only INT8)
- **Quantized part**: `transformer`
- **Compute dtype**: `bfloat16`
- **Pipeline**: `diffusers.ZImagePipeline`
- **Negative prompt support**: Yes (same pipeline API as the base model)

## Platform Support

- ✅ Supported: Linux/Windows with NVIDIA CUDA
- ⚠️ Limited support: macOS Apple Silicon (MPS, usually much slower than CUDA)
- ❌ Not supported: macOS Intel

## Files

Key files in this repository:
- `model_index.json`
- `transformer/diffusion_pytorch_model.safetensors` (INT8-quantized weights)
- `text_encoder/*`, `vae/*`, `scheduler/*`, `tokenizer/*` (not quantized)
- `zimage_quanto_bench_results/*` (benchmark metrics and baseline-vs-int8 images)
- `test_outputs/*` (generated examples)

## Installation

Python 3.10+ is recommended.

```bash
# Create env (optional)
python -m venv .venv

# Windows
.venv\Scripts\activate

# Linux/macOS
# source .venv/bin/activate

python -m pip install --upgrade pip

# PyTorch (NVIDIA CUDA, example)
pip install torch --index-url https://download.pytorch.org/whl/cu128

# PyTorch (macOS Apple Silicon, MPS)
# pip install torch

# Inference dependencies
pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow
```

## Quick Start (Diffusers)

This repo already stores quantized weights, so you do **not** need to re-run quantization during loading.

```python
import torch

from diffusers import ZImagePipeline

model_id = "ixim/Z-Image-INT8"

if torch.cuda.is_available():
    device = "cuda"
    dtype = torch.bfloat16
elif torch.backends.mps.is_available():
    # Apple Silicon
    device = "mps"
    dtype = torch.bfloat16
else:
    # CPU fallback (functional but very slow for this model)
    device = "cpu"
    dtype = torch.float32

pipe = ZImagePipeline.from_pretrained(
    model_id,
    torch_dtype=dtype,
    low_cpu_mem_usage=True,
)

pipe.enable_attention_slicing()

if device == "cuda":
    pipe.enable_model_cpu_offload()
else:
    pipe = pipe.to(device)

prompt = "A cinematic portrait of a young woman, soft lighting, high detail"
negative_prompt = "blurry, sad, low quality, distorted face, extra limbs, artifacts"
# Use CPU generator for best cross-device compatibility (cpu/mps/cuda)
generator = torch.Generator(device="cpu").manual_seed(42)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=1024,
    width=1024,
    num_inference_steps=28,
    guidance_scale=4.0,
    generator=generator,
).images[0]

image.save("zimage_int8_sample.png")
print("Saved: zimage_int8_sample.png")
```

## macOS Notes & Troubleshooting

- macOS Intel is no longer supported for this model in this repository.
- If you need macOS inference, use Apple Silicon (`mps`) only.
- On Apple Silicon, warnings like `CUDA not available` and `Disabling autocast` are expected in non-CUDA execution paths.
- Slow speed on Mac is expected compared with high-end NVIDIA GPUs. To improve speed on Apple Silicon:
    - Ensure the script uses `mps` (as in the example above), not `cpu`.
    - Start from `height=512`, `width=512`, and fewer steps (e.g., `20~28`) before scaling up.

## Additional Generated Samples (INT8)

These two images are generated with this quantized model:

### 1) `en_portrait_1024x1024.png`

- **Prompt**: `A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic`

<div align="center"><img src="test_outputs/en_portrait_1024x1024.png" width="512" /></div>

### 2) `cn_scene_1024x1024.png`

- **Prompt**: `一只橘猫趴在堆满旧书的木桌上打盹,午后阳光透过窗帘洒进来,暖色调,胶片风格,细腻毛发纹理,超高清`

<div align="center"><img src="test_outputs/cn_scene_1024x1024.png" width="512" /></div>

## Benchmark & Performance

Test environment:
- GPU: NVIDIA GeForce RTX 5090
- Framework: PyTorch 2.10.0+cu130
- Inference setting: 1024×1024, 50 steps, guidance=4.0, CPU offload enabled
- Cases: 5 prompts (`portrait_01`, `portrait_02`, `landscape_01`, `scene_01`, `night_01`)

### Aggregate Comparison (Baseline vs INT8)

| Metric | Baseline | INT8 | Delta |
|---|---:|---:|---:|
| Avg elapsed / image (s) | 49.0282 | 46.7867 | **-4.6%** |
| Avg sec / step | 0.980564 | 0.935733 | **-4.6%** |
| Avg peak CUDA alloc (GB) | 12.5195 | 7.7470 | **-38.1%** |


> Results may vary across hardware, drivers, and PyTorch/CUDA versions.

### Per-Case Results

| Case | Baseline (s) | INT8 (s) | Speedup |
|---|---:|---:|---:|
| portrait_01 | 56.9943 | 50.1124 | 1.14x |
| portrait_02 | 50.3810 | 46.0371 | 1.09x |
| landscape_01 | 46.0286 | 46.0192 | 1.00x |
| scene_01 | 45.9097 | 45.8291 | 1.00x |
| night_01 | 45.8275 | 45.9356 | 1.00x |

## Visual Comparison (Baseline vs INT8)

Left: Baseline. Right: INT8. (Same prompt/seed/steps.)

| Case | Base | INT8 |
|---|---|---|
| portrait_01 | ![](zimage_quanto_bench_results/images/baseline/portrait_01_seed46.png) | ![](zimage_quanto_bench_results/images/int8/portrait_01_seed46.png) |
| portrait_02 | ![](zimage_quanto_bench_results/images/baseline/portrait_02_seed111.png) | ![](zimage_quanto_bench_results/images/int8/portrait_02_seed111.png) |
| landscape_01 | ![](zimage_quanto_bench_results/images/baseline/landscape_01_seed123.png) | ![](zimage_quanto_bench_results/images/int8/landscape_01_seed123.png) |
| scene_01 | ![](zimage_quanto_bench_results/images/baseline/scene_01_seed777.png) | ![](zimage_quanto_bench_results/images/int8/scene_01_seed777.png) |
| night_01 | ![](zimage_quanto_bench_results/images/baseline/night_01_seed2026.png) | ![](zimage_quanto_bench_results/images/int8/night_01_seed2026.png) |

## Limitations

- This is **weight-only INT8** quantization; activation precision is unchanged.
- Minor visual differences may appear on some prompts.
- `enable_model_cpu_offload()` can change latency distribution across pipeline stages.
- For extreme resolutions / very long step counts, validate quality and stability first.

## Intended Use

Recommended for:
- Running Z-Image with lower VRAM usage.
- Improving throughput while keeping quality close to baseline.

Not recommended as-is for:
- Safety-critical decision workflows.
- High-risk generation use cases without additional review/guardrails.

## Citation

If you use this model, please cite/reference the upstream model and toolchain:
- Tongyi-MAI/Z-Image
- Hugging Face Diffusers
- optimum-quanto