Z-Image-INT8 / README.md
ixim's picture
Update README.md
dc005bc verified
metadata
language:
  - en
license: other
library_name: diffusers
pipeline_tag: text-to-image
tags:
  - text-to-image
  - diffusers
  - quanto
  - int8
  - z-image
  - transformer-quantization
base_model:
  - Tongyi-MAI/Z-Image
base_model_relation: quantized

Z-Image INT8 (Quanto)

This repository provides an INT8-quantized variant of Tongyi-MAI/Z-Image:

  • Only the transformer is quantized with Quanto weight-only INT8.
  • text_encoder, vae, scheduler, and tokenizer remain unchanged.
  • Inference API stays compatible with diffusers.ZImagePipeline.

Please follow the original upstream model license and usage terms. license: other means this repo inherits upstream licensing constraints.

Model Details

  • Base model: Tongyi-MAI/Z-Image
  • Quantization method: optimum-quanto (weight-only INT8)
  • Quantized part: transformer
  • Compute dtype: bfloat16
  • Pipeline: diffusers.ZImagePipeline
  • Negative prompt support: Yes (same pipeline API as the base model)

Platform Support

  • ✅ Supported: Linux/Windows with NVIDIA CUDA
  • ⚠️ Limited support: macOS Apple Silicon (MPS, usually much slower than CUDA)
  • ❌ Not supported: macOS Intel

Files

Key files in this repository:

  • model_index.json
  • transformer/diffusion_pytorch_model.safetensors (INT8-quantized weights)
  • text_encoder/*, vae/*, scheduler/*, tokenizer/* (not quantized)
  • zimage_quanto_bench_results/* (benchmark metrics and baseline-vs-int8 images)
  • test_outputs/* (generated examples)

Installation

Python 3.10+ is recommended.

# Create env (optional)
python -m venv .venv

# Windows
.venv\Scripts\activate

# Linux/macOS
# source .venv/bin/activate

python -m pip install --upgrade pip

# PyTorch (NVIDIA CUDA, example)
pip install torch --index-url https://download.pytorch.org/whl/cu128

# PyTorch (macOS Apple Silicon, MPS)
# pip install torch

# Inference dependencies
pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow

Quick Start (Diffusers)

This repo already stores quantized weights, so you do not need to re-run quantization during loading.

import torch

from diffusers import ZImagePipeline

model_id = "ixim/Z-Image-INT8"

if torch.cuda.is_available():
    device = "cuda"
    dtype = torch.bfloat16
elif torch.backends.mps.is_available():
    # Apple Silicon
    device = "mps"
    dtype = torch.bfloat16
else:
    # CPU fallback (functional but very slow for this model)
    device = "cpu"
    dtype = torch.float32

pipe = ZImagePipeline.from_pretrained(
    model_id,
    torch_dtype=dtype,
    low_cpu_mem_usage=True,
)

pipe.enable_attention_slicing()

if device == "cuda":
    pipe.enable_model_cpu_offload()
else:
    pipe = pipe.to(device)

prompt = "A cinematic portrait of a young woman, soft lighting, high detail"
negative_prompt = "blurry, sad, low quality, distorted face, extra limbs, artifacts"
# Use CPU generator for best cross-device compatibility (cpu/mps/cuda)
generator = torch.Generator(device="cpu").manual_seed(42)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=1024,
    width=1024,
    num_inference_steps=28,
    guidance_scale=4.0,
    generator=generator,
).images[0]

image.save("zimage_int8_sample.png")
print("Saved: zimage_int8_sample.png")

macOS Notes & Troubleshooting

  • macOS Intel is no longer supported for this model in this repository.
  • If you need macOS inference, use Apple Silicon (mps) only.
  • On Apple Silicon, warnings like CUDA not available and Disabling autocast are expected in non-CUDA execution paths.
  • Slow speed on Mac is expected compared with high-end NVIDIA GPUs. To improve speed on Apple Silicon:
    • Ensure the script uses mps (as in the example above), not cpu.
    • Start from height=512, width=512, and fewer steps (e.g., 20~28) before scaling up.

Additional Generated Samples (INT8)

These two images are generated with this quantized model:

1) en_portrait_1024x1024.png

  • Prompt: A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic

2) cn_scene_1024x1024.png

  • Prompt: 一只橘猫趴在堆满旧书的木桌上打盹,午后阳光透过窗帘洒进来,暖色调,胶片风格,细腻毛发纹理,超高清

Benchmark & Performance

Test environment:

  • GPU: NVIDIA GeForce RTX 5090
  • Framework: PyTorch 2.10.0+cu130
  • Inference setting: 1024×1024, 50 steps, guidance=4.0, CPU offload enabled
  • Cases: 5 prompts (portrait_01, portrait_02, landscape_01, scene_01, night_01)

Aggregate Comparison (Baseline vs INT8)

Metric Baseline INT8 Delta
Avg elapsed / image (s) 49.0282 46.7867 -4.6%
Avg sec / step 0.980564 0.935733 -4.6%
Avg peak CUDA alloc (GB) 12.5195 7.7470 -38.1%

Results may vary across hardware, drivers, and PyTorch/CUDA versions.

Per-Case Results

Case Baseline (s) INT8 (s) Speedup
portrait_01 56.9943 50.1124 1.14x
portrait_02 50.3810 46.0371 1.09x
landscape_01 46.0286 46.0192 1.00x
scene_01 45.9097 45.8291 1.00x
night_01 45.8275 45.9356 1.00x

Visual Comparison (Baseline vs INT8)

Left: Baseline. Right: INT8. (Same prompt/seed/steps.)

Case Base INT8
portrait_01
portrait_02
landscape_01
scene_01
night_01

Limitations

  • This is weight-only INT8 quantization; activation precision is unchanged.
  • Minor visual differences may appear on some prompts.
  • enable_model_cpu_offload() can change latency distribution across pipeline stages.
  • For extreme resolutions / very long step counts, validate quality and stability first.

Intended Use

Recommended for:

  • Running Z-Image with lower VRAM usage.
  • Improving throughput while keeping quality close to baseline.

Not recommended as-is for:

  • Safety-critical decision workflows.
  • High-risk generation use cases without additional review/guardrails.

Citation

If you use this model, please cite/reference the upstream model and toolchain:

  • Tongyi-MAI/Z-Image
  • Hugging Face Diffusers
  • optimum-quanto