Z-Image-INT8 / README.md

ixim

Update README.md

dc005bc verified 1 day ago

preview code

raw

history blame contribute delete

7.27 kB

metadata

language:
  - en
license: other
library_name: diffusers
pipeline_tag: text-to-image
tags:
  - text-to-image
  - diffusers
  - quanto
  - int8
  - z-image
  - transformer-quantization
base_model:
  - Tongyi-MAI/Z-Image
base_model_relation: quantized

Z-Image INT8 (Quanto)

This repository provides an INT8-quantized variant of Tongyi-MAI/Z-Image:

Only the transformer is quantized with Quanto weight-only INT8.
text_encoder, vae, scheduler, and tokenizer remain unchanged.
Inference API stays compatible with diffusers.ZImagePipeline.

Please follow the original upstream model license and usage terms. license: other means this repo inherits upstream licensing constraints.

Model Details

Base model: Tongyi-MAI/Z-Image
Quantization method: optimum-quanto (weight-only INT8)
Quantized part: transformer
Compute dtype: bfloat16
Pipeline: diffusers.ZImagePipeline
Negative prompt support: Yes (same pipeline API as the base model)

Platform Support

✅ Supported: Linux/Windows with NVIDIA CUDA
⚠️ Limited support: macOS Apple Silicon (MPS, usually much slower than CUDA)
❌ Not supported: macOS Intel

Files

Key files in this repository:

model_index.json
transformer/diffusion_pytorch_model.safetensors (INT8-quantized weights)
text_encoder/*, vae/*, scheduler/*, tokenizer/* (not quantized)
zimage_quanto_bench_results/* (benchmark metrics and baseline-vs-int8 images)
test_outputs/* (generated examples)

Installation

Python 3.10+ is recommended.

# Create env (optional)
python -m venv .venv

# Windows
.venv\Scripts\activate

# Linux/macOS
# source .venv/bin/activate

python -m pip install --upgrade pip

# PyTorch (NVIDIA CUDA, example)
pip install torch --index-url https://download.pytorch.org/whl/cu128

# PyTorch (macOS Apple Silicon, MPS)
# pip install torch

# Inference dependencies
pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow

Quick Start (Diffusers)

This repo already stores quantized weights, so you do not need to re-run quantization during loading.

import torch

from diffusers import ZImagePipeline

model_id = "ixim/Z-Image-INT8"

if torch.cuda.is_available():
    device = "cuda"
    dtype = torch.bfloat16
elif torch.backends.mps.is_available():
    # Apple Silicon
    device = "mps"
    dtype = torch.bfloat16
else:
    # CPU fallback (functional but very slow for this model)
    device = "cpu"
    dtype = torch.float32

pipe = ZImagePipeline.from_pretrained(
    model_id,
    torch_dtype=dtype,
    low_cpu_mem_usage=True,
)

pipe.enable_attention_slicing()

if device == "cuda":
    pipe.enable_model_cpu_offload()
else:
    pipe = pipe.to(device)

prompt = "A cinematic portrait of a young woman, soft lighting, high detail"
negative_prompt = "blurry, sad, low quality, distorted face, extra limbs, artifacts"
# Use CPU generator for best cross-device compatibility (cpu/mps/cuda)
generator = torch.Generator(device="cpu").manual_seed(42)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=1024,
    width=1024,
    num_inference_steps=28,
    guidance_scale=4.0,
    generator=generator,
).images[0]

image.save("zimage_int8_sample.png")
print("Saved: zimage_int8_sample.png")

macOS Notes & Troubleshooting

macOS Intel is no longer supported for this model in this repository.
If you need macOS inference, use Apple Silicon (mps) only.
On Apple Silicon, warnings like CUDA not available and Disabling autocast are expected in non-CUDA execution paths.
Slow speed on Mac is expected compared with high-end NVIDIA GPUs. To improve speed on Apple Silicon:
- Ensure the script uses mps (as in the example above), not cpu.
- Start from height=512, width=512, and fewer steps (e.g., 20~28) before scaling up.

Additional Generated Samples (INT8)

These two images are generated with this quantized model:

1) `en_portrait_1024x1024.png`

Prompt: A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic

2) `cn_scene_1024x1024.png`

Prompt: 一只橘猫趴在堆满旧书的木桌上打盹，午后阳光透过窗帘洒进来，暖色调，胶片风格，细腻毛发纹理，超高清

Benchmark & Performance

Test environment:

GPU: NVIDIA GeForce RTX 5090
Framework: PyTorch 2.10.0+cu130
Inference setting: 1024×1024, 50 steps, guidance=4.0, CPU offload enabled
Cases: 5 prompts (portrait_01, portrait_02, landscape_01, scene_01, night_01)

Aggregate Comparison (Baseline vs INT8)

Metric	Baseline	INT8	Delta
Avg elapsed / image (s)	49.0282	46.7867	-4.6%
Avg sec / step	0.980564	0.935733	-4.6%
Avg peak CUDA alloc (GB)	12.5195	7.7470	-38.1%

Results may vary across hardware, drivers, and PyTorch/CUDA versions.

Per-Case Results

Case	Baseline (s)	INT8 (s)	Speedup
portrait_01	56.9943	50.1124	1.14x
portrait_02	50.3810	46.0371	1.09x
landscape_01	46.0286	46.0192	1.00x
scene_01	45.9097	45.8291	1.00x
night_01	45.8275	45.9356	1.00x

Visual Comparison (Baseline vs INT8)

Left: Baseline. Right: INT8. (Same prompt/seed/steps.)

Case	Base	INT8
portrait_01
portrait_02
landscape_01
scene_01
night_01

Limitations

This is weight-only INT8 quantization; activation precision is unchanged.
Minor visual differences may appear on some prompts.
enable_model_cpu_offload() can change latency distribution across pipeline stages.
For extreme resolutions / very long step counts, validate quality and stability first.

Intended Use

Recommended for:

Running Z-Image with lower VRAM usage.
Improving throughput while keeping quality close to baseline.

Not recommended as-is for:

Safety-critical decision workflows.
High-risk generation use cases without additional review/guardrails.

Citation

If you use this model, please cite/reference the upstream model and toolchain:

Tongyi-MAI/Z-Image
Hugging Face Diffusers
optimum-quanto