Z-Image-INT8 / README.md
ixim's picture
Update README.md
dc005bc verified
---
language:
- en
license: other
library_name: diffusers
pipeline_tag: text-to-image
tags:
- text-to-image
- diffusers
- quanto
- int8
- z-image
- transformer-quantization
base_model:
- Tongyi-MAI/Z-Image
base_model_relation: quantized
---
# Z-Image INT8 (Quanto)
This repository provides an INT8-quantized variant of [Tongyi-MAI/Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image):
- **Only** the `transformer` is quantized with **Quanto weight-only INT8**.
- `text_encoder`, `vae`, `scheduler`, and `tokenizer` remain unchanged.
- Inference API stays compatible with `diffusers.ZImagePipeline`.
> Please follow the original upstream model license and usage terms. `license: other` means this repo inherits upstream licensing constraints.
## Model Details
- **Base model**: `Tongyi-MAI/Z-Image`
- **Quantization method**: `optimum-quanto` (weight-only INT8)
- **Quantized part**: `transformer`
- **Compute dtype**: `bfloat16`
- **Pipeline**: `diffusers.ZImagePipeline`
- **Negative prompt support**: Yes (same pipeline API as the base model)
## Platform Support
- ✅ Supported: Linux/Windows with NVIDIA CUDA
- ⚠️ Limited support: macOS Apple Silicon (MPS, usually much slower than CUDA)
- ❌ Not supported: macOS Intel
## Files
Key files in this repository:
- `model_index.json`
- `transformer/diffusion_pytorch_model.safetensors` (INT8-quantized weights)
- `text_encoder/*`, `vae/*`, `scheduler/*`, `tokenizer/*` (not quantized)
- `zimage_quanto_bench_results/*` (benchmark metrics and baseline-vs-int8 images)
- `test_outputs/*` (generated examples)
## Installation
Python 3.10+ is recommended.
```bash
# Create env (optional)
python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux/macOS
# source .venv/bin/activate
python -m pip install --upgrade pip
# PyTorch (NVIDIA CUDA, example)
pip install torch --index-url https://download.pytorch.org/whl/cu128
# PyTorch (macOS Apple Silicon, MPS)
# pip install torch
# Inference dependencies
pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow
```
## Quick Start (Diffusers)
This repo already stores quantized weights, so you do **not** need to re-run quantization during loading.
```python
import torch
from diffusers import ZImagePipeline
model_id = "ixim/Z-Image-INT8"
if torch.cuda.is_available():
device = "cuda"
dtype = torch.bfloat16
elif torch.backends.mps.is_available():
# Apple Silicon
device = "mps"
dtype = torch.bfloat16
else:
# CPU fallback (functional but very slow for this model)
device = "cpu"
dtype = torch.float32
pipe = ZImagePipeline.from_pretrained(
model_id,
torch_dtype=dtype,
low_cpu_mem_usage=True,
)
pipe.enable_attention_slicing()
if device == "cuda":
pipe.enable_model_cpu_offload()
else:
pipe = pipe.to(device)
prompt = "A cinematic portrait of a young woman, soft lighting, high detail"
negative_prompt = "blurry, sad, low quality, distorted face, extra limbs, artifacts"
# Use CPU generator for best cross-device compatibility (cpu/mps/cuda)
generator = torch.Generator(device="cpu").manual_seed(42)
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=1024,
width=1024,
num_inference_steps=28,
guidance_scale=4.0,
generator=generator,
).images[0]
image.save("zimage_int8_sample.png")
print("Saved: zimage_int8_sample.png")
```
## macOS Notes & Troubleshooting
- macOS Intel is no longer supported for this model in this repository.
- If you need macOS inference, use Apple Silicon (`mps`) only.
- On Apple Silicon, warnings like `CUDA not available` and `Disabling autocast` are expected in non-CUDA execution paths.
- Slow speed on Mac is expected compared with high-end NVIDIA GPUs. To improve speed on Apple Silicon:
- Ensure the script uses `mps` (as in the example above), not `cpu`.
- Start from `height=512`, `width=512`, and fewer steps (e.g., `20~28`) before scaling up.
## Additional Generated Samples (INT8)
These two images are generated with this quantized model:
### 1) `en_portrait_1024x1024.png`
- **Prompt**: `A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic`
<div align="center"><img src="test_outputs/en_portrait_1024x1024.png" width="512" /></div>
### 2) `cn_scene_1024x1024.png`
- **Prompt**: `一只橘猫趴在堆满旧书的木桌上打盹,午后阳光透过窗帘洒进来,暖色调,胶片风格,细腻毛发纹理,超高清`
<div align="center"><img src="test_outputs/cn_scene_1024x1024.png" width="512" /></div>
## Benchmark & Performance
Test environment:
- GPU: NVIDIA GeForce RTX 5090
- Framework: PyTorch 2.10.0+cu130
- Inference setting: 1024×1024, 50 steps, guidance=4.0, CPU offload enabled
- Cases: 5 prompts (`portrait_01`, `portrait_02`, `landscape_01`, `scene_01`, `night_01`)
### Aggregate Comparison (Baseline vs INT8)
| Metric | Baseline | INT8 | Delta |
|---|---:|---:|---:|
| Avg elapsed / image (s) | 49.0282 | 46.7867 | **-4.6%** |
| Avg sec / step | 0.980564 | 0.935733 | **-4.6%** |
| Avg peak CUDA alloc (GB) | 12.5195 | 7.7470 | **-38.1%** |
> Results may vary across hardware, drivers, and PyTorch/CUDA versions.
### Per-Case Results
| Case | Baseline (s) | INT8 (s) | Speedup |
|---|---:|---:|---:|
| portrait_01 | 56.9943 | 50.1124 | 1.14x |
| portrait_02 | 50.3810 | 46.0371 | 1.09x |
| landscape_01 | 46.0286 | 46.0192 | 1.00x |
| scene_01 | 45.9097 | 45.8291 | 1.00x |
| night_01 | 45.8275 | 45.9356 | 1.00x |
## Visual Comparison (Baseline vs INT8)
Left: Baseline. Right: INT8. (Same prompt/seed/steps.)
| Case | Base | INT8 |
|---|---|---|
| portrait_01 | ![](zimage_quanto_bench_results/images/baseline/portrait_01_seed46.png) | ![](zimage_quanto_bench_results/images/int8/portrait_01_seed46.png) |
| portrait_02 | ![](zimage_quanto_bench_results/images/baseline/portrait_02_seed111.png) | ![](zimage_quanto_bench_results/images/int8/portrait_02_seed111.png) |
| landscape_01 | ![](zimage_quanto_bench_results/images/baseline/landscape_01_seed123.png) | ![](zimage_quanto_bench_results/images/int8/landscape_01_seed123.png) |
| scene_01 | ![](zimage_quanto_bench_results/images/baseline/scene_01_seed777.png) | ![](zimage_quanto_bench_results/images/int8/scene_01_seed777.png) |
| night_01 | ![](zimage_quanto_bench_results/images/baseline/night_01_seed2026.png) | ![](zimage_quanto_bench_results/images/int8/night_01_seed2026.png) |
## Limitations
- This is **weight-only INT8** quantization; activation precision is unchanged.
- Minor visual differences may appear on some prompts.
- `enable_model_cpu_offload()` can change latency distribution across pipeline stages.
- For extreme resolutions / very long step counts, validate quality and stability first.
## Intended Use
Recommended for:
- Running Z-Image with lower VRAM usage.
- Improving throughput while keeping quality close to baseline.
Not recommended as-is for:
- Safety-critical decision workflows.
- High-risk generation use cases without additional review/guardrails.
## Citation
If you use this model, please cite/reference the upstream model and toolchain:
- Tongyi-MAI/Z-Image
- Hugging Face Diffusers
- optimum-quanto