| | --- |
| | language: |
| | - en |
| | license: other |
| | library_name: diffusers |
| | pipeline_tag: text-to-image |
| | tags: |
| | - text-to-image |
| | - diffusers |
| | - quanto |
| | - int8 |
| | - z-image |
| | - transformer-quantization |
| | base_model: |
| | - Tongyi-MAI/Z-Image |
| | base_model_relation: quantized |
| | --- |
| | |
| | # Z-Image INT8 (Quanto) |
| |
|
| | This repository provides an INT8-quantized variant of [Tongyi-MAI/Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image): |
| | - **Only** the `transformer` is quantized with **Quanto weight-only INT8**. |
| | - `text_encoder`, `vae`, `scheduler`, and `tokenizer` remain unchanged. |
| | - Inference API stays compatible with `diffusers.ZImagePipeline`. |
| |
|
| | > Please follow the original upstream model license and usage terms. `license: other` means this repo inherits upstream licensing constraints. |
| |
|
| | ## Model Details |
| |
|
| | - **Base model**: `Tongyi-MAI/Z-Image` |
| | - **Quantization method**: `optimum-quanto` (weight-only INT8) |
| | - **Quantized part**: `transformer` |
| | - **Compute dtype**: `bfloat16` |
| | - **Pipeline**: `diffusers.ZImagePipeline` |
| | - **Negative prompt support**: Yes (same pipeline API as the base model) |
| |
|
| | ## Platform Support |
| |
|
| | - ✅ Supported: Linux/Windows with NVIDIA CUDA |
| | - ⚠️ Limited support: macOS Apple Silicon (MPS, usually much slower than CUDA) |
| | - ❌ Not supported: macOS Intel |
| |
|
| | ## Files |
| |
|
| | Key files in this repository: |
| | - `model_index.json` |
| | - `transformer/diffusion_pytorch_model.safetensors` (INT8-quantized weights) |
| | - `text_encoder/*`, `vae/*`, `scheduler/*`, `tokenizer/*` (not quantized) |
| | - `zimage_quanto_bench_results/*` (benchmark metrics and baseline-vs-int8 images) |
| | - `test_outputs/*` (generated examples) |
| |
|
| | ## Installation |
| |
|
| | Python 3.10+ is recommended. |
| |
|
| | ```bash |
| | # Create env (optional) |
| | python -m venv .venv |
| | |
| | # Windows |
| | .venv\Scripts\activate |
| | |
| | # Linux/macOS |
| | # source .venv/bin/activate |
| | |
| | python -m pip install --upgrade pip |
| | |
| | # PyTorch (NVIDIA CUDA, example) |
| | pip install torch --index-url https://download.pytorch.org/whl/cu128 |
| | |
| | # PyTorch (macOS Apple Silicon, MPS) |
| | # pip install torch |
| | |
| | # Inference dependencies |
| | pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow |
| | ``` |
| |
|
| | ## Quick Start (Diffusers) |
| |
|
| | This repo already stores quantized weights, so you do **not** need to re-run quantization during loading. |
| |
|
| | ```python |
| | import torch |
| | |
| | from diffusers import ZImagePipeline |
| | |
| | model_id = "ixim/Z-Image-INT8" |
| | |
| | if torch.cuda.is_available(): |
| | device = "cuda" |
| | dtype = torch.bfloat16 |
| | elif torch.backends.mps.is_available(): |
| | # Apple Silicon |
| | device = "mps" |
| | dtype = torch.bfloat16 |
| | else: |
| | # CPU fallback (functional but very slow for this model) |
| | device = "cpu" |
| | dtype = torch.float32 |
| | |
| | pipe = ZImagePipeline.from_pretrained( |
| | model_id, |
| | torch_dtype=dtype, |
| | low_cpu_mem_usage=True, |
| | ) |
| | |
| | pipe.enable_attention_slicing() |
| | |
| | if device == "cuda": |
| | pipe.enable_model_cpu_offload() |
| | else: |
| | pipe = pipe.to(device) |
| | |
| | prompt = "A cinematic portrait of a young woman, soft lighting, high detail" |
| | negative_prompt = "blurry, sad, low quality, distorted face, extra limbs, artifacts" |
| | # Use CPU generator for best cross-device compatibility (cpu/mps/cuda) |
| | generator = torch.Generator(device="cpu").manual_seed(42) |
| | |
| | image = pipe( |
| | prompt=prompt, |
| | negative_prompt=negative_prompt, |
| | height=1024, |
| | width=1024, |
| | num_inference_steps=28, |
| | guidance_scale=4.0, |
| | generator=generator, |
| | ).images[0] |
| | |
| | image.save("zimage_int8_sample.png") |
| | print("Saved: zimage_int8_sample.png") |
| | ``` |
| |
|
| | ## macOS Notes & Troubleshooting |
| |
|
| | - macOS Intel is no longer supported for this model in this repository. |
| | - If you need macOS inference, use Apple Silicon (`mps`) only. |
| | - On Apple Silicon, warnings like `CUDA not available` and `Disabling autocast` are expected in non-CUDA execution paths. |
| | - Slow speed on Mac is expected compared with high-end NVIDIA GPUs. To improve speed on Apple Silicon: |
| | - Ensure the script uses `mps` (as in the example above), not `cpu`. |
| | - Start from `height=512`, `width=512`, and fewer steps (e.g., `20~28`) before scaling up. |
| |
|
| | ## Additional Generated Samples (INT8) |
| |
|
| | These two images are generated with this quantized model: |
| |
|
| | ### 1) `en_portrait_1024x1024.png` |
| |
|
| | - **Prompt**: `A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic` |
| |
|
| | <div align="center"><img src="test_outputs/en_portrait_1024x1024.png" width="512" /></div> |
| |
|
| | ### 2) `cn_scene_1024x1024.png` |
| |
|
| | - **Prompt**: `一只橘猫趴在堆满旧书的木桌上打盹,午后阳光透过窗帘洒进来,暖色调,胶片风格,细腻毛发纹理,超高清` |
| |
|
| | <div align="center"><img src="test_outputs/cn_scene_1024x1024.png" width="512" /></div> |
| |
|
| | ## Benchmark & Performance |
| |
|
| | Test environment: |
| | - GPU: NVIDIA GeForce RTX 5090 |
| | - Framework: PyTorch 2.10.0+cu130 |
| | - Inference setting: 1024×1024, 50 steps, guidance=4.0, CPU offload enabled |
| | - Cases: 5 prompts (`portrait_01`, `portrait_02`, `landscape_01`, `scene_01`, `night_01`) |
| |
|
| | ### Aggregate Comparison (Baseline vs INT8) |
| |
|
| | | Metric | Baseline | INT8 | Delta | |
| | |---|---:|---:|---:| |
| | | Avg elapsed / image (s) | 49.0282 | 46.7867 | **-4.6%** | |
| | | Avg sec / step | 0.980564 | 0.935733 | **-4.6%** | |
| | | Avg peak CUDA alloc (GB) | 12.5195 | 7.7470 | **-38.1%** | |
| |
|
| |
|
| | > Results may vary across hardware, drivers, and PyTorch/CUDA versions. |
| |
|
| | ### Per-Case Results |
| |
|
| | | Case | Baseline (s) | INT8 (s) | Speedup | |
| | |---|---:|---:|---:| |
| | | portrait_01 | 56.9943 | 50.1124 | 1.14x | |
| | | portrait_02 | 50.3810 | 46.0371 | 1.09x | |
| | | landscape_01 | 46.0286 | 46.0192 | 1.00x | |
| | | scene_01 | 45.9097 | 45.8291 | 1.00x | |
| | | night_01 | 45.8275 | 45.9356 | 1.00x | |
| | |
| | ## Visual Comparison (Baseline vs INT8) |
| | |
| | Left: Baseline. Right: INT8. (Same prompt/seed/steps.) |
| | |
| | | Case | Base | INT8 | |
| | |---|---|---| |
| | | portrait_01 |  |  | |
| | | portrait_02 |  |  | |
| | | landscape_01 |  |  | |
| | | scene_01 |  |  | |
| | | night_01 |  |  | |
| |
|
| | ## Limitations |
| |
|
| | - This is **weight-only INT8** quantization; activation precision is unchanged. |
| | - Minor visual differences may appear on some prompts. |
| | - `enable_model_cpu_offload()` can change latency distribution across pipeline stages. |
| | - For extreme resolutions / very long step counts, validate quality and stability first. |
| |
|
| | ## Intended Use |
| |
|
| | Recommended for: |
| | - Running Z-Image with lower VRAM usage. |
| | - Improving throughput while keeping quality close to baseline. |
| |
|
| | Not recommended as-is for: |
| | - Safety-critical decision workflows. |
| | - High-risk generation use cases without additional review/guardrails. |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite/reference the upstream model and toolchain: |
| | - Tongyi-MAI/Z-Image |
| | - Hugging Face Diffusers |
| | - optimum-quanto |
| |
|