metadata
language:
- en
license: other
library_name: diffusers
pipeline_tag: text-to-image
tags:
- text-to-image
- diffusers
- quanto
- int8
- z-image
- transformer-quantization
base_model:
- Tongyi-MAI/Z-Image
base_model_relation: quantized
Z-Image INT8 (Quanto)
This repository provides an INT8-quantized variant of Tongyi-MAI/Z-Image:
- Only the
transformeris quantized with Quanto weight-only INT8. text_encoder,vae,scheduler, andtokenizerremain unchanged.- Inference API stays compatible with
diffusers.ZImagePipeline.
Please follow the original upstream model license and usage terms.
license: othermeans this repo inherits upstream licensing constraints.
Model Details
- Base model:
Tongyi-MAI/Z-Image - Quantization method:
optimum-quanto(weight-only INT8) - Quantized part:
transformer - Compute dtype:
bfloat16 - Pipeline:
diffusers.ZImagePipeline - Negative prompt support: Yes (same pipeline API as the base model)
Platform Support
- ✅ Supported: Linux/Windows with NVIDIA CUDA
- ⚠️ Limited support: macOS Apple Silicon (MPS, usually much slower than CUDA)
- ❌ Not supported: macOS Intel
Files
Key files in this repository:
model_index.jsontransformer/diffusion_pytorch_model.safetensors(INT8-quantized weights)text_encoder/*,vae/*,scheduler/*,tokenizer/*(not quantized)zimage_quanto_bench_results/*(benchmark metrics and baseline-vs-int8 images)test_outputs/*(generated examples)
Installation
Python 3.10+ is recommended.
# Create env (optional)
python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux/macOS
# source .venv/bin/activate
python -m pip install --upgrade pip
# PyTorch (NVIDIA CUDA, example)
pip install torch --index-url https://download.pytorch.org/whl/cu128
# PyTorch (macOS Apple Silicon, MPS)
# pip install torch
# Inference dependencies
pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow
Quick Start (Diffusers)
This repo already stores quantized weights, so you do not need to re-run quantization during loading.
import torch
from diffusers import ZImagePipeline
model_id = "ixim/Z-Image-INT8"
if torch.cuda.is_available():
device = "cuda"
dtype = torch.bfloat16
elif torch.backends.mps.is_available():
# Apple Silicon
device = "mps"
dtype = torch.bfloat16
else:
# CPU fallback (functional but very slow for this model)
device = "cpu"
dtype = torch.float32
pipe = ZImagePipeline.from_pretrained(
model_id,
torch_dtype=dtype,
low_cpu_mem_usage=True,
)
pipe.enable_attention_slicing()
if device == "cuda":
pipe.enable_model_cpu_offload()
else:
pipe = pipe.to(device)
prompt = "A cinematic portrait of a young woman, soft lighting, high detail"
negative_prompt = "blurry, sad, low quality, distorted face, extra limbs, artifacts"
# Use CPU generator for best cross-device compatibility (cpu/mps/cuda)
generator = torch.Generator(device="cpu").manual_seed(42)
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=1024,
width=1024,
num_inference_steps=28,
guidance_scale=4.0,
generator=generator,
).images[0]
image.save("zimage_int8_sample.png")
print("Saved: zimage_int8_sample.png")
macOS Notes & Troubleshooting
- macOS Intel is no longer supported for this model in this repository.
- If you need macOS inference, use Apple Silicon (
mps) only. - On Apple Silicon, warnings like
CUDA not availableandDisabling autocastare expected in non-CUDA execution paths. - Slow speed on Mac is expected compared with high-end NVIDIA GPUs. To improve speed on Apple Silicon:
- Ensure the script uses
mps(as in the example above), notcpu. - Start from
height=512,width=512, and fewer steps (e.g.,20~28) before scaling up.
- Ensure the script uses
Additional Generated Samples (INT8)
These two images are generated with this quantized model:
1) en_portrait_1024x1024.png
- Prompt:
A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic

2) cn_scene_1024x1024.png
- Prompt:
一只橘猫趴在堆满旧书的木桌上打盹,午后阳光透过窗帘洒进来,暖色调,胶片风格,细腻毛发纹理,超高清

Benchmark & Performance
Test environment:
- GPU: NVIDIA GeForce RTX 5090
- Framework: PyTorch 2.10.0+cu130
- Inference setting: 1024×1024, 50 steps, guidance=4.0, CPU offload enabled
- Cases: 5 prompts (
portrait_01,portrait_02,landscape_01,scene_01,night_01)
Aggregate Comparison (Baseline vs INT8)
| Metric | Baseline | INT8 | Delta |
|---|---|---|---|
| Avg elapsed / image (s) | 49.0282 | 46.7867 | -4.6% |
| Avg sec / step | 0.980564 | 0.935733 | -4.6% |
| Avg peak CUDA alloc (GB) | 12.5195 | 7.7470 | -38.1% |
Results may vary across hardware, drivers, and PyTorch/CUDA versions.
Per-Case Results
| Case | Baseline (s) | INT8 (s) | Speedup |
|---|---|---|---|
| portrait_01 | 56.9943 | 50.1124 | 1.14x |
| portrait_02 | 50.3810 | 46.0371 | 1.09x |
| landscape_01 | 46.0286 | 46.0192 | 1.00x |
| scene_01 | 45.9097 | 45.8291 | 1.00x |
| night_01 | 45.8275 | 45.9356 | 1.00x |
Visual Comparison (Baseline vs INT8)
Left: Baseline. Right: INT8. (Same prompt/seed/steps.)
Limitations
- This is weight-only INT8 quantization; activation precision is unchanged.
- Minor visual differences may appear on some prompts.
enable_model_cpu_offload()can change latency distribution across pipeline stages.- For extreme resolutions / very long step counts, validate quality and stability first.
Intended Use
Recommended for:
- Running Z-Image with lower VRAM usage.
- Improving throughput while keeping quality close to baseline.
Not recommended as-is for:
- Safety-critical decision workflows.
- High-risk generation use cases without additional review/guardrails.
Citation
If you use this model, please cite/reference the upstream model and toolchain:
- Tongyi-MAI/Z-Image
- Hugging Face Diffusers
- optimum-quanto









