ixim
/

Z-Image-INT8

@@ -1,191 +1,213 @@
----
-language:
-- en
-license: other
-library_name: diffusers
-pipeline_tag: text-to-image
-tags:
-- text-to-image
-- diffusers
-- quanto
-- int8
-- z-image
-- transformer-quantization
-base_model:
-- Tongyi-MAI/Z-Image
-base_model_relation: quantized
----
-# Z-Image INT8 (Quanto)
-This repository provides an INT8-quantized variant of [Tongyi-MAI/Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image):
-- **Only** the `transformer` is quantized with **Quanto weight-only INT8**.
-- `text_encoder`, `vae`, `scheduler`, and `tokenizer` remain unchanged.
-- Inference API stays compatible with `diffusers.ZImagePipeline`.
-> Please follow the original upstream model license and usage terms. `license: other` means this repo inherits upstream licensing constraints.
-## Model Details
-- **Base model**: `Tongyi-MAI/Z-Image`
-- **Quantization method**: `optimum-quanto` (weight-only INT8)
-- **Quantized part**: `transformer`
-- **Compute dtype**: `bfloat16`
-- **Pipeline**: `diffusers.ZImagePipeline`
-- **Negative prompt support**: Yes (same pipeline API as the base model)
-## Files
-Key files in this repository:
-- `model_index.json`
-- `transformer/diffusion_pytorch_model.safetensors` (INT8-quantized weights)
-- `text_encoder/*`, `vae/*`, `scheduler/*`, `tokenizer/*` (not quantized)
-- `zimage_quanto_bench_results/*` (benchmark metrics and baseline-vs-int8 images)
-- `test_outputs/*` (generated examples)
-## Installation
-Python 3.10+ is recommended.
-```bash
-# Create env (optional)
-python -m venv .venv
-# Windows
-.venv\Scripts\activate
-# Linux/macOS
-# source .venv/bin/activate
-python -m pip install --upgrade pip
-# PyTorch (NVIDIA CUDA, example)
-pip install torch --index-url https://download.pytorch.org/whl/cu128
-# PyTorch (macOS / CPU-only example)
-# pip install torch
-# Inference dependencies
-pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow
-```
-## Quick Start (Diffusers)
-This repo already stores quantized weights, so you do **not** need to re-run quantization during loading.
-```python
-import torch
-from diffusers import ZImagePipeline
-model_id = "ixim/Z-Image-INT8"
-device = "cuda" if torch.cuda.is_available() else "cpu"
-dtype = torch.bfloat16 if device == "cuda" else torch.float32
-pipe = ZImagePipeline.from_pretrained(
-    model_id,
-    torch_dtype=dtype,
-    low_cpu_mem_usage=True,
-)
-if device == "cuda":
-    pipe.enable_model_cpu_offload()
-else:
-    pipe = pipe.to("cpu")
-prompt = "A cinematic portrait of a young woman, soft lighting, high detail"
-negative_prompt = "blurry, low quality, distorted face, extra limbs, artifacts"
-generator = torch.Generator(device=device).manual_seed(42)
-image = pipe(
-    prompt=prompt,
-    negative_prompt=negative_prompt,
-    height=1024,
-    width=1024,
-    num_inference_steps=28,
-    guidance_scale=4.0,
-    generator=generator,
-).images[0]
-image.save("zimage_int8_sample.png")
-print("Saved: zimage_int8_sample.png")
-```
-## Additional Generated Samples (INT8)
-These two images are generated with this quantized model:
-### 1) `en_portrait_1024x1024.png`
-- **Prompt**: `A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic`
-<div align="center"><img src="test_outputs/en_portrait_1024x1024.png" width="512" /></div>
-### 2) `cn_scene_1024x1024.png`
-- **Prompt**: `一只橘猫趴在堆满旧书的木桌上打盹，午后阳光透过窗帘洒进来，暖色调，胶片风格，细腻毛发纹理，超高清`
-<div align="center"><img src="test_outputs/cn_scene_1024x1024.png" width="512" /></div>
-## Benchmark & Performance
-Test environment:
-- GPU: NVIDIA GeForce RTX 5090
-- Framework: PyTorch 2.10.0+cu130
-- Inference setting: 1024×1024, 28 steps, guidance=4.0, CPU offload enabled
-- Cases: 4 prompts (`portrait_01`, `portrait_02`, `scene_01`, `night_01`)
-### Aggregate Comparison (Baseline vs INT8)
-| Metric | Baseline | INT8 | Delta |
-|---|---:|---:|---:|
-| Avg elapsed / image (s) | 51.7766 | 39.5662 | **-23.6%** |
-| Avg sec / step | 1.8492 | 1.4131 | **-23.6%** |
-| Avg peak CUDA alloc (GB) | 12.5195 | 7.7470 | **-38.1%** |
-> Results may vary across hardware, drivers, and PyTorch/CUDA versions.
-### Per-Case Results
-| Case | Baseline (s) | INT8 (s) | Speedup |
-|---|---:|---:|---:|
-| portrait_01 | 99.9223 | 60.6768 | 1.65x |
-| portrait_02 | 37.4116 | 32.8863 | 1.14x |
-| scene_01 | 34.9946 | 32.2035 | 1.09x |
-| night_01 | 34.7780 | 32.4981 | 1.07x |
-## Visual Comparison (Baseline vs INT8)
-Left: Baseline. Right: INT8. (Same prompt/seed/steps.)
-| Case | Base | INT8 |
-|---|---|---|
-| portrait_01 | ![](zimage_quanto_bench_results/images/baseline/portrait_01_seed46.png) | ![](zimage_quanto_bench_results/images/int8/portrait_01_seed46.png) |
-| portrait_02 | ![](zimage_quanto_bench_results/images/baseline/portrait_02_seed123.png) | ![](zimage_quanto_bench_results/images/int8/portrait_02_seed123.png) |
-| scene_01 | ![](zimage_quanto_bench_results/images/baseline/scene_01_seed777.png) | ![](zimage_quanto_bench_results/images/int8/scene_01_seed777.png) |
-| night_01 | ![](zimage_quanto_bench_results/images/baseline/night_01_seed2026.png) | ![](zimage_quanto_bench_results/images/int8/night_01_seed2026.png) |
-## Limitations
-- This is **weight-only INT8** quantization; activation precision is unchanged.
-- Minor visual differences may appear on some prompts.
-- `enable_model_cpu_offload()` can change latency distribution across pipeline stages.
-- For extreme resolutions / very long step counts, validate quality and stability first.
-## Intended Use
-Recommended for:
-- Running Z-Image with lower VRAM usage.
-- Improving throughput while keeping quality close to baseline.
-Not recommended as-is for:
-- Safety-critical decision workflows.
-- High-risk generation use cases without additional review/guardrails.
-## Citation
-If you use this model, please cite/reference the upstream model and toolchain:
-- Tongyi-MAI/Z-Image
-- Hugging Face Diffusers
-- optimum-quanto

+---
+language:
+- en
+license: other
+library_name: diffusers
+pipeline_tag: text-to-image
+tags:
+- text-to-image
+- diffusers
+- quanto
+- int8
+- z-image
+- transformer-quantization
+base_model:
+- Tongyi-MAI/Z-Image
+---
+# Z-Image INT8 (Quanto)
+This repository provides an INT8-quantized variant of [Tongyi-MAI/Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image):
+- **Only** the `transformer` is quantized with **Quanto weight-only INT8**.
+- `text_encoder`, `vae`, `scheduler`, and `tokenizer` remain unchanged.
+- Inference API stays compatible with `diffusers.ZImagePipeline`.
+> Please follow the original upstream model license and usage terms. `license: other` means this repo inherits upstream licensing constraints.
+## Model Details
+- **Base model**: `Tongyi-MAI/Z-Image`
+- **Quantization method**: `optimum-quanto` (weight-only INT8)
+- **Quantized part**: `transformer`
+- **Compute dtype**: `bfloat16`
+- **Pipeline**: `diffusers.ZImagePipeline`
+- **Negative prompt support**: Yes (same pipeline API as the base model)
+## Files
+Key files in this repository:
+- `model_index.json`
+- `transformer/diffusion_pytorch_model.safetensors` (INT8-quantized weights)
+- `text_encoder/*`, `vae/*`, `scheduler/*`, `tokenizer/*` (not quantized)
+- `zimage_quanto_bench_results/*` (benchmark metrics and baseline-vs-int8 images)
+- `test_outputs/*` (generated examples)
+## Installation
+Python 3.10+ is recommended.
+```bash
+# Create env (optional)
+python -m venv .venv
+# Windows
+.venv\Scripts\activate
+# Linux/macOS
+# source .venv/bin/activate
+python -m pip install --upgrade pip
+# PyTorch (NVIDIA CUDA, example)
+pip install torch --index-url https://download.pytorch.org/whl/cu128
+# PyTorch (macOS / CPU-only example)
+# pip install torch
+# Inference dependencies
+pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow
+# Recommended minimum versions (helps avoid backend compatibility issues)
+pip install -U "torch>=2.4" "diffusers>=0.36.0" "accelerate>=0.33"
+```
+## Quick Start (Diffusers)
+This repo already stores quantized weights, so you do **not** need to re-run quantization during loading.
+```python
+import torch
+from diffusers import ZImagePipeline
+model_id = "ixim/Z-Image-INT8"
+if torch.cuda.is_available():
+    device = "cuda"
+    dtype = torch.bfloat16
+elif torch.backends.mps.is_available():
+    # Apple Silicon
+    device = "mps"
+    dtype = torch.float16
+else:
+    # Intel Mac / CPU-only
+    device = "cpu"
+    dtype = torch.float32
+pipe = ZImagePipeline.from_pretrained(
+    model_id,
+    torch_dtype=dtype,
+    low_cpu_mem_usage=True,
+)
+if device == "cuda":
+    pipe.enable_model_cpu_offload()
+else:
+    pipe = pipe.to(device)
+prompt = "A cinematic portrait of a young woman, soft lighting, high detail"
+negative_prompt = "blurry, low quality, distorted face, extra limbs, artifacts"
+# Use CPU generator for best cross-device compatibility (cpu/mps/cuda)
+generator = torch.Generator(device="cpu").manual_seed(42)
+image = pipe(
+    prompt=prompt,
+    negative_prompt=negative_prompt,
+    height=1024,
+    width=1024,
+    num_inference_steps=28,
+    guidance_scale=4.0,
+    generator=generator,
+).images[0]
+image.save("zimage_int8_sample.png")
+print("Saved: zimage_int8_sample.png")
+```
+## macOS Notes & Troubleshooting
+- `AttributeError: module 'torch' has no attribute 'xpu'` is usually a backend/version compatibility issue in the local environment, not a model issue.
+- Fix it by upgrading to recent versions:
+    - `pip install -U "torch>=2.4" "diffusers>=0.36.0" "accelerate>=0.33"`
+- On Apple Silicon, warnings like `CUDA not available` and `Disabling autocast` are expected in non-CUDA execution paths.
+- Slow speed on Mac is expected compared with high-end NVIDIA GPUs. To improve speed on Apple Silicon:
+    - Ensure the script uses `mps` (as in the example above), not `cpu`.
+    - Start from `height=512`, `width=512`, and fewer steps (e.g., `20~28`) before scaling up.
+## Additional Generated Samples (INT8)
+These two images are generated with this quantized model:
+### 1) `en_portrait_1024x1024.png`
+- **Prompt**: `A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic`
+<div align="center"><img src="test_outputs/en_portrait_1024x1024.png" width="512" /></div>
+### 2) `cn_scene_1024x1024.png`
+- **Prompt**: `一只橘猫趴在堆满旧书的木桌上打盹，午后阳光透过窗帘洒进来，暖色调，胶片风格，细腻毛发纹理，超高清`
+<div align="center"><img src="test_outputs/cn_scene_1024x1024.png" width="512" /></div>
+## Benchmark & Performance
+Test environment:
+- GPU: NVIDIA GeForce RTX 5090
+- Framework: PyTorch 2.10.0+cu130
+- Inference setting: 1024×1024, 28 steps, guidance=4.0, CPU offload enabled
+- Cases: 4 prompts (`portrait_01`, `portrait_02`, `scene_01`, `night_01`)
+### Aggregate Comparison (Baseline vs INT8)
+| Metric | Baseline | INT8 | Delta |
+|---|---:|---:|---:|
+| Avg elapsed / image (s) | 51.7766 | 39.5662 | **-23.6%** |
+| Avg sec / step | 1.8492 | 1.4131 | **-23.6%** |
+| Avg peak CUDA alloc (GB) | 12.5195 | 7.7470 | **-38.1%** |
+> Results may vary across hardware, drivers, and PyTorch/CUDA versions.
+### Per-Case Results
+| Case | Baseline (s) | INT8 (s) | Speedup |
+|---|---:|---:|---:|
+| portrait_01 | 99.9223 | 60.6768 | 1.65x |
+| portrait_02 | 37.4116 | 32.8863 | 1.14x |
+| scene_01 | 34.9946 | 32.2035 | 1.09x |
+| night_01 | 34.7780 | 32.4981 | 1.07x |
+## Visual Comparison (Baseline vs INT8)
+Left: Baseline. Right: INT8. (Same prompt/seed/steps.)
+| Case | Base | INT8 |
+|---|---|---|
+| portrait_01 | ![](zimage_quanto_bench_results/images/baseline/portrait_01_seed46.png) | ![](zimage_quanto_bench_results/images/int8/portrait_01_seed46.png) |
+| portrait_02 | ![](zimage_quanto_bench_results/images/baseline/portrait_02_seed123.png) | ![](zimage_quanto_bench_results/images/int8/portrait_02_seed123.png) |
+| scene_01 | ![](zimage_quanto_bench_results/images/baseline/scene_01_seed777.png) | ![](zimage_quanto_bench_results/images/int8/scene_01_seed777.png) |
+| night_01 | ![](zimage_quanto_bench_results/images/baseline/night_01_seed2026.png) | ![](zimage_quanto_bench_results/images/int8/night_01_seed2026.png) |
+## Limitations
+- This is **weight-only INT8** quantization; activation precision is unchanged.
+- Minor visual differences may appear on some prompts.
+- `enable_model_cpu_offload()` can change latency distribution across pipeline stages.
+- For extreme resolutions / very long step counts, validate quality and stability first.
+## Intended Use
+Recommended for:
+- Running Z-Image with lower VRAM usage.
+- Improving throughput while keeping quality close to baseline.
+Not recommended as-is for:
+- Safety-critical decision workflows.
+- High-risk generation use cases without additional review/guardrails.
+## Citation
+If you use this model, please cite/reference the upstream model and toolchain:
+- Tongyi-MAI/Z-Image
+- Hugging Face Diffusers
+- optimum-quanto