File size: 7,273 Bytes

---
language:
- en
license: other
library_name: diffusers
pipeline_tag: text-to-image
tags:
- text-to-image
- diffusers
- quanto
- int8
- z-image
- transformer-quantization
base_model:
- Tongyi-MAI/Z-Image
base_model_relation: quantized
---

# Z-Image INT8 (Quanto)

This repository provides an INT8-quantized variant of [Tongyi-MAI/Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image):
- **Only** the `transformer` is quantized with **Quanto weight-only INT8**.
- `text_encoder`, `vae`, `scheduler`, and `tokenizer` remain unchanged.
- Inference API stays compatible with `diffusers.ZImagePipeline`.

> Please follow the original upstream model license and usage terms. `license: other` means this repo inherits upstream licensing constraints.

## Model Details

- **Base model**: `Tongyi-MAI/Z-Image`
- **Quantization method**: `optimum-quanto` (weight-only INT8)
- **Quantized part**: `transformer`
- **Compute dtype**: `bfloat16`
- **Pipeline**: `diffusers.ZImagePipeline`
- **Negative prompt support**: Yes (same pipeline API as the base model)

## Platform Support

- ✅ Supported: Linux/Windows with NVIDIA CUDA
- ⚠️ Limited support: macOS Apple Silicon (MPS, usually much slower than CUDA)
- ❌ Not supported: macOS Intel

## Files

Key files in this repository:
- `model_index.json`
- `transformer/diffusion_pytorch_model.safetensors` (INT8-quantized weights)
- `text_encoder/*`, `vae/*`, `scheduler/*`, `tokenizer/*` (not quantized)
- `zimage_quanto_bench_results/*` (benchmark metrics and baseline-vs-int8 images)
- `test_outputs/*` (generated examples)

## Installation

Python 3.10+ is recommended.

```bash
# Create env (optional)
python -m venv .venv

# Windows
.venv\Scripts\activate

# Linux/macOS
# source .venv/bin/activate

python -m pip install --upgrade pip

# PyTorch (NVIDIA CUDA, example)
pip install torch --index-url https://download.pytorch.org/whl/cu128

# PyTorch (macOS Apple Silicon, MPS)
# pip install torch

# Inference dependencies
pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow
```

## Quick Start (Diffusers)

This repo already stores quantized weights, so you do **not** need to re-run quantization during loading.

```python
import torch

from diffusers import ZImagePipeline

model_id = "ixim/Z-Image-INT8"

if torch.cuda.is_available():
    device = "cuda"
    dtype = torch.bfloat16
elif torch.backends.mps.is_available():
    # Apple Silicon
    device = "mps"
    dtype = torch.bfloat16
else:
    # CPU fallback (functional but very slow for this model)
    device = "cpu"
    dtype = torch.float32

pipe = ZImagePipeline.from_pretrained(
    model_id,
    torch_dtype=dtype,
    low_cpu_mem_usage=True,
)

pipe.enable_attention_slicing()

if device == "cuda":
    pipe.enable_model_cpu_offload()
else:
    pipe = pipe.to(device)

prompt = "A cinematic portrait of a young woman, soft lighting, high detail"
negative_prompt = "blurry, sad, low quality, distorted face, extra limbs, artifacts"
# Use CPU generator for best cross-device compatibility (cpu/mps/cuda)
generator = torch.Generator(device="cpu").manual_seed(42)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=1024,
    width=1024,
    num_inference_steps=28,
    guidance_scale=4.0,
    generator=generator,
).images[0]

image.save("zimage_int8_sample.png")
print("Saved: zimage_int8_sample.png")
```

## macOS Notes & Troubleshooting

- macOS Intel is no longer supported for this model in this repository.
- If you need macOS inference, use Apple Silicon (`mps`) only.
- On Apple Silicon, warnings like `CUDA not available` and `Disabling autocast` are expected in non-CUDA execution paths.
- Slow speed on Mac is expected compared with high-end NVIDIA GPUs. To improve speed on Apple Silicon:
    - Ensure the script uses `mps` (as in the example above), not `cpu`.
    - Start from `height=512`, `width=512`, and fewer steps (e.g., `20~28`) before scaling up.

## Additional Generated Samples (INT8)

These two images are generated with this quantized model:

### 1) `en_portrait_1024x1024.png`

- **Prompt**: `A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic`

<div align="center"><img src="test_outputs/en_portrait_1024x1024.png" width="512" /></div>

### 2) `cn_scene_1024x1024.png`

- **Prompt**: `一只橘猫趴在堆满旧书的木桌上打盹，午后阳光透过窗帘洒进来，暖色调，胶片风格，细腻毛发纹理，超高清`

<div align="center"><img src="test_outputs/cn_scene_1024x1024.png" width="512" /></div>

## Benchmark & Performance

Test environment:
- GPU: NVIDIA GeForce RTX 5090
- Framework: PyTorch 2.10.0+cu130
- Inference setting: 1024×1024, 50 steps, guidance=4.0, CPU offload enabled
- Cases: 5 prompts (`portrait_01`, `portrait_02`, `landscape_01`, `scene_01`, `night_01`)

### Aggregate Comparison (Baseline vs INT8)

| Metric | Baseline | INT8 | Delta |
|---|---:|---:|---:|
| Avg elapsed / image (s) | 49.0282 | 46.7867 | **-4.6%** |
| Avg sec / step | 0.980564 | 0.935733 | **-4.6%** |
| Avg peak CUDA alloc (GB) | 12.5195 | 7.7470 | **-38.1%** |


> Results may vary across hardware, drivers, and PyTorch/CUDA versions.

### Per-Case Results

| Case | Baseline (s) | INT8 (s) | Speedup |
|---|---:|---:|---:|
| portrait_01 | 56.9943 | 50.1124 | 1.14x |
| portrait_02 | 50.3810 | 46.0371 | 1.09x |
| landscape_01 | 46.0286 | 46.0192 | 1.00x |
| scene_01 | 45.9097 | 45.8291 | 1.00x |
| night_01 | 45.8275 | 45.9356 | 1.00x |

## Visual Comparison (Baseline vs INT8)

Left: Baseline. Right: INT8. (Same prompt/seed/steps.)

| Case | Base | INT8 |
|---|---|---|
| portrait_01 | ![](zimage_quanto_bench_results/images/baseline/portrait_01_seed46.png) | ![](zimage_quanto_bench_results/images/int8/portrait_01_seed46.png) |
| portrait_02 | ![](zimage_quanto_bench_results/images/baseline/portrait_02_seed111.png) | ![](zimage_quanto_bench_results/images/int8/portrait_02_seed111.png) |
| landscape_01 | ![](zimage_quanto_bench_results/images/baseline/landscape_01_seed123.png) | ![](zimage_quanto_bench_results/images/int8/landscape_01_seed123.png) |
| scene_01 | ![](zimage_quanto_bench_results/images/baseline/scene_01_seed777.png) | ![](zimage_quanto_bench_results/images/int8/scene_01_seed777.png) |
| night_01 | ![](zimage_quanto_bench_results/images/baseline/night_01_seed2026.png) | ![](zimage_quanto_bench_results/images/int8/night_01_seed2026.png) |

## Limitations

- This is **weight-only INT8** quantization; activation precision is unchanged.
- Minor visual differences may appear on some prompts.
- `enable_model_cpu_offload()` can change latency distribution across pipeline stages.
- For extreme resolutions / very long step counts, validate quality and stability first.

## Intended Use

Recommended for:
- Running Z-Image with lower VRAM usage.
- Improving throughput while keeping quality close to baseline.

Not recommended as-is for:
- Safety-critical decision workflows.
- High-risk generation use cases without additional review/guardrails.

## Citation

If you use this model, please cite/reference the upstream model and toolchain:
- Tongyi-MAI/Z-Image
- Hugging Face Diffusers
- optimum-quanto