File size: 7,273 Bytes
8f26642 dc005bc 8f26642 dc005bc 8f26642 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 | ---
language:
- en
license: other
library_name: diffusers
pipeline_tag: text-to-image
tags:
- text-to-image
- diffusers
- quanto
- int8
- z-image
- transformer-quantization
base_model:
- Tongyi-MAI/Z-Image
base_model_relation: quantized
---
# Z-Image INT8 (Quanto)
This repository provides an INT8-quantized variant of [Tongyi-MAI/Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image):
- **Only** the `transformer` is quantized with **Quanto weight-only INT8**.
- `text_encoder`, `vae`, `scheduler`, and `tokenizer` remain unchanged.
- Inference API stays compatible with `diffusers.ZImagePipeline`.
> Please follow the original upstream model license and usage terms. `license: other` means this repo inherits upstream licensing constraints.
## Model Details
- **Base model**: `Tongyi-MAI/Z-Image`
- **Quantization method**: `optimum-quanto` (weight-only INT8)
- **Quantized part**: `transformer`
- **Compute dtype**: `bfloat16`
- **Pipeline**: `diffusers.ZImagePipeline`
- **Negative prompt support**: Yes (same pipeline API as the base model)
## Platform Support
- ✅ Supported: Linux/Windows with NVIDIA CUDA
- ⚠️ Limited support: macOS Apple Silicon (MPS, usually much slower than CUDA)
- ❌ Not supported: macOS Intel
## Files
Key files in this repository:
- `model_index.json`
- `transformer/diffusion_pytorch_model.safetensors` (INT8-quantized weights)
- `text_encoder/*`, `vae/*`, `scheduler/*`, `tokenizer/*` (not quantized)
- `zimage_quanto_bench_results/*` (benchmark metrics and baseline-vs-int8 images)
- `test_outputs/*` (generated examples)
## Installation
Python 3.10+ is recommended.
```bash
# Create env (optional)
python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux/macOS
# source .venv/bin/activate
python -m pip install --upgrade pip
# PyTorch (NVIDIA CUDA, example)
pip install torch --index-url https://download.pytorch.org/whl/cu128
# PyTorch (macOS Apple Silicon, MPS)
# pip install torch
# Inference dependencies
pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow
```
## Quick Start (Diffusers)
This repo already stores quantized weights, so you do **not** need to re-run quantization during loading.
```python
import torch
from diffusers import ZImagePipeline
model_id = "ixim/Z-Image-INT8"
if torch.cuda.is_available():
device = "cuda"
dtype = torch.bfloat16
elif torch.backends.mps.is_available():
# Apple Silicon
device = "mps"
dtype = torch.bfloat16
else:
# CPU fallback (functional but very slow for this model)
device = "cpu"
dtype = torch.float32
pipe = ZImagePipeline.from_pretrained(
model_id,
torch_dtype=dtype,
low_cpu_mem_usage=True,
)
pipe.enable_attention_slicing()
if device == "cuda":
pipe.enable_model_cpu_offload()
else:
pipe = pipe.to(device)
prompt = "A cinematic portrait of a young woman, soft lighting, high detail"
negative_prompt = "blurry, sad, low quality, distorted face, extra limbs, artifacts"
# Use CPU generator for best cross-device compatibility (cpu/mps/cuda)
generator = torch.Generator(device="cpu").manual_seed(42)
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=1024,
width=1024,
num_inference_steps=28,
guidance_scale=4.0,
generator=generator,
).images[0]
image.save("zimage_int8_sample.png")
print("Saved: zimage_int8_sample.png")
```
## macOS Notes & Troubleshooting
- macOS Intel is no longer supported for this model in this repository.
- If you need macOS inference, use Apple Silicon (`mps`) only.
- On Apple Silicon, warnings like `CUDA not available` and `Disabling autocast` are expected in non-CUDA execution paths.
- Slow speed on Mac is expected compared with high-end NVIDIA GPUs. To improve speed on Apple Silicon:
- Ensure the script uses `mps` (as in the example above), not `cpu`.
- Start from `height=512`, `width=512`, and fewer steps (e.g., `20~28`) before scaling up.
## Additional Generated Samples (INT8)
These two images are generated with this quantized model:
### 1) `en_portrait_1024x1024.png`
- **Prompt**: `A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic`
<div align="center"><img src="test_outputs/en_portrait_1024x1024.png" width="512" /></div>
### 2) `cn_scene_1024x1024.png`
- **Prompt**: `一只橘猫趴在堆满旧书的木桌上打盹,午后阳光透过窗帘洒进来,暖色调,胶片风格,细腻毛发纹理,超高清`
<div align="center"><img src="test_outputs/cn_scene_1024x1024.png" width="512" /></div>
## Benchmark & Performance
Test environment:
- GPU: NVIDIA GeForce RTX 5090
- Framework: PyTorch 2.10.0+cu130
- Inference setting: 1024×1024, 50 steps, guidance=4.0, CPU offload enabled
- Cases: 5 prompts (`portrait_01`, `portrait_02`, `landscape_01`, `scene_01`, `night_01`)
### Aggregate Comparison (Baseline vs INT8)
| Metric | Baseline | INT8 | Delta |
|---|---:|---:|---:|
| Avg elapsed / image (s) | 49.0282 | 46.7867 | **-4.6%** |
| Avg sec / step | 0.980564 | 0.935733 | **-4.6%** |
| Avg peak CUDA alloc (GB) | 12.5195 | 7.7470 | **-38.1%** |
> Results may vary across hardware, drivers, and PyTorch/CUDA versions.
### Per-Case Results
| Case | Baseline (s) | INT8 (s) | Speedup |
|---|---:|---:|---:|
| portrait_01 | 56.9943 | 50.1124 | 1.14x |
| portrait_02 | 50.3810 | 46.0371 | 1.09x |
| landscape_01 | 46.0286 | 46.0192 | 1.00x |
| scene_01 | 45.9097 | 45.8291 | 1.00x |
| night_01 | 45.8275 | 45.9356 | 1.00x |
## Visual Comparison (Baseline vs INT8)
Left: Baseline. Right: INT8. (Same prompt/seed/steps.)
| Case | Base | INT8 |
|---|---|---|
| portrait_01 |  |  |
| portrait_02 |  |  |
| landscape_01 |  |  |
| scene_01 |  |  |
| night_01 |  |  |
## Limitations
- This is **weight-only INT8** quantization; activation precision is unchanged.
- Minor visual differences may appear on some prompts.
- `enable_model_cpu_offload()` can change latency distribution across pipeline stages.
- For extreme resolutions / very long step counts, validate quality and stability first.
## Intended Use
Recommended for:
- Running Z-Image with lower VRAM usage.
- Improving throughput while keeping quality close to baseline.
Not recommended as-is for:
- Safety-critical decision workflows.
- High-risk generation use cases without additional review/guardrails.
## Citation
If you use this model, please cite/reference the upstream model and toolchain:
- Tongyi-MAI/Z-Image
- Hugging Face Diffusers
- optimum-quanto
|