--- language: - en license: other library_name: diffusers pipeline_tag: text-to-image tags: - text-to-image - diffusers - quanto - int8 - z-image - transformer-quantization base_model: - Tongyi-MAI/Z-Image base_model_relation: quantized --- # Z-Image INT8 (Quanto) This repository provides an INT8-quantized variant of [Tongyi-MAI/Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image): - **Only** the `transformer` is quantized with **Quanto weight-only INT8**. - `text_encoder`, `vae`, `scheduler`, and `tokenizer` remain unchanged. - Inference API stays compatible with `diffusers.ZImagePipeline`. > Please follow the original upstream model license and usage terms. `license: other` means this repo inherits upstream licensing constraints. ## Model Details - **Base model**: `Tongyi-MAI/Z-Image` - **Quantization method**: `optimum-quanto` (weight-only INT8) - **Quantized part**: `transformer` - **Compute dtype**: `bfloat16` - **Pipeline**: `diffusers.ZImagePipeline` - **Negative prompt support**: Yes (same pipeline API as the base model) ## Platform Support - ✅ Supported: Linux/Windows with NVIDIA CUDA - ⚠️ Limited support: macOS Apple Silicon (MPS, usually much slower than CUDA) - ❌ Not supported: macOS Intel ## Files Key files in this repository: - `model_index.json` - `transformer/diffusion_pytorch_model.safetensors` (INT8-quantized weights) - `text_encoder/*`, `vae/*`, `scheduler/*`, `tokenizer/*` (not quantized) - `zimage_quanto_bench_results/*` (benchmark metrics and baseline-vs-int8 images) - `test_outputs/*` (generated examples) ## Installation Python 3.10+ is recommended. ```bash # Create env (optional) python -m venv .venv # Windows .venv\Scripts\activate # Linux/macOS # source .venv/bin/activate python -m pip install --upgrade pip # PyTorch (NVIDIA CUDA, example) pip install torch --index-url https://download.pytorch.org/whl/cu128 # PyTorch (macOS Apple Silicon, MPS) # pip install torch # Inference dependencies pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow ``` ## Quick Start (Diffusers) This repo already stores quantized weights, so you do **not** need to re-run quantization during loading. ```python import torch from diffusers import ZImagePipeline model_id = "ixim/Z-Image-INT8" if torch.cuda.is_available(): device = "cuda" dtype = torch.bfloat16 elif torch.backends.mps.is_available(): # Apple Silicon device = "mps" dtype = torch.bfloat16 else: # CPU fallback (functional but very slow for this model) device = "cpu" dtype = torch.float32 pipe = ZImagePipeline.from_pretrained( model_id, torch_dtype=dtype, low_cpu_mem_usage=True, ) pipe.enable_attention_slicing() if device == "cuda": pipe.enable_model_cpu_offload() else: pipe = pipe.to(device) prompt = "A cinematic portrait of a young woman, soft lighting, high detail" negative_prompt = "blurry, sad, low quality, distorted face, extra limbs, artifacts" # Use CPU generator for best cross-device compatibility (cpu/mps/cuda) generator = torch.Generator(device="cpu").manual_seed(42) image = pipe( prompt=prompt, negative_prompt=negative_prompt, height=1024, width=1024, num_inference_steps=28, guidance_scale=4.0, generator=generator, ).images[0] image.save("zimage_int8_sample.png") print("Saved: zimage_int8_sample.png") ``` ## macOS Notes & Troubleshooting - macOS Intel is no longer supported for this model in this repository. - If you need macOS inference, use Apple Silicon (`mps`) only. - On Apple Silicon, warnings like `CUDA not available` and `Disabling autocast` are expected in non-CUDA execution paths. - Slow speed on Mac is expected compared with high-end NVIDIA GPUs. To improve speed on Apple Silicon: - Ensure the script uses `mps` (as in the example above), not `cpu`. - Start from `height=512`, `width=512`, and fewer steps (e.g., `20~28`) before scaling up. ## Additional Generated Samples (INT8) These two images are generated with this quantized model: ### 1) `en_portrait_1024x1024.png` - **Prompt**: `A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic`
### 2) `cn_scene_1024x1024.png` - **Prompt**: `一只橘猫趴在堆满旧书的木桌上打盹,午后阳光透过窗帘洒进来,暖色调,胶片风格,细腻毛发纹理,超高清`
## Benchmark & Performance Test environment: - GPU: NVIDIA GeForce RTX 5090 - Framework: PyTorch 2.10.0+cu130 - Inference setting: 1024×1024, 50 steps, guidance=4.0, CPU offload enabled - Cases: 5 prompts (`portrait_01`, `portrait_02`, `landscape_01`, `scene_01`, `night_01`) ### Aggregate Comparison (Baseline vs INT8) | Metric | Baseline | INT8 | Delta | |---|---:|---:|---:| | Avg elapsed / image (s) | 49.0282 | 46.7867 | **-4.6%** | | Avg sec / step | 0.980564 | 0.935733 | **-4.6%** | | Avg peak CUDA alloc (GB) | 12.5195 | 7.7470 | **-38.1%** | > Results may vary across hardware, drivers, and PyTorch/CUDA versions. ### Per-Case Results | Case | Baseline (s) | INT8 (s) | Speedup | |---|---:|---:|---:| | portrait_01 | 56.9943 | 50.1124 | 1.14x | | portrait_02 | 50.3810 | 46.0371 | 1.09x | | landscape_01 | 46.0286 | 46.0192 | 1.00x | | scene_01 | 45.9097 | 45.8291 | 1.00x | | night_01 | 45.8275 | 45.9356 | 1.00x | ## Visual Comparison (Baseline vs INT8) Left: Baseline. Right: INT8. (Same prompt/seed/steps.) | Case | Base | INT8 | |---|---|---| | portrait_01 | ![](zimage_quanto_bench_results/images/baseline/portrait_01_seed46.png) | ![](zimage_quanto_bench_results/images/int8/portrait_01_seed46.png) | | portrait_02 | ![](zimage_quanto_bench_results/images/baseline/portrait_02_seed111.png) | ![](zimage_quanto_bench_results/images/int8/portrait_02_seed111.png) | | landscape_01 | ![](zimage_quanto_bench_results/images/baseline/landscape_01_seed123.png) | ![](zimage_quanto_bench_results/images/int8/landscape_01_seed123.png) | | scene_01 | ![](zimage_quanto_bench_results/images/baseline/scene_01_seed777.png) | ![](zimage_quanto_bench_results/images/int8/scene_01_seed777.png) | | night_01 | ![](zimage_quanto_bench_results/images/baseline/night_01_seed2026.png) | ![](zimage_quanto_bench_results/images/int8/night_01_seed2026.png) | ## Limitations - This is **weight-only INT8** quantization; activation precision is unchanged. - Minor visual differences may appear on some prompts. - `enable_model_cpu_offload()` can change latency distribution across pipeline stages. - For extreme resolutions / very long step counts, validate quality and stability first. ## Intended Use Recommended for: - Running Z-Image with lower VRAM usage. - Improving throughput while keeping quality close to baseline. Not recommended as-is for: - Safety-critical decision workflows. - High-risk generation use cases without additional review/guardrails. ## Citation If you use this model, please cite/reference the upstream model and toolchain: - Tongyi-MAI/Z-Image - Hugging Face Diffusers - optimum-quanto