Update README.md

dc005bc verified 1 day ago

7.27 kB

	---
	language:
	- en
	license: other
	library_name: diffusers
	pipeline_tag: text-to-image
	tags:
	- text-to-image
	- diffusers
	- quanto
	- int8
	- z-image
	- transformer-quantization
	base_model:
	- Tongyi-MAI/Z-Image
	base_model_relation: quantized
	---

	# Z-Image INT8 (Quanto)

	This repository provides an INT8-quantized variant of [Tongyi-MAI/Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image):
	- Only the `transformer` is quantized with Quanto weight-only INT8.
	- `text_encoder`, `vae`, `scheduler`, and `tokenizer` remain unchanged.
	- Inference API stays compatible with `diffusers.ZImagePipeline`.

	> Please follow the original upstream model license and usage terms. `license: other` means this repo inherits upstream licensing constraints.

	## Model Details

	- Base model: `Tongyi-MAI/Z-Image`
	- Quantization method: `optimum-quanto` (weight-only INT8)
	- Quantized part: `transformer`
	- Compute dtype: `bfloat16`
	- Pipeline: `diffusers.ZImagePipeline`
	- Negative prompt support: Yes (same pipeline API as the base model)

	## Platform Support

	- ✅ Supported: Linux/Windows with NVIDIA CUDA
	- ⚠️ Limited support: macOS Apple Silicon (MPS, usually much slower than CUDA)
	- ❌ Not supported: macOS Intel

	## Files

	Key files in this repository:
	- `model_index.json`
	- `transformer/diffusion_pytorch_model.safetensors` (INT8-quantized weights)
	- `text_encoder/`, `vae/`, `scheduler/`, `tokenizer/` (not quantized)
	- `zimage_quanto_bench_results/*` (benchmark metrics and baseline-vs-int8 images)
	- `test_outputs/*` (generated examples)

	## Installation

	Python 3.10+ is recommended.

	```bash
	# Create env (optional)
	python -m venv .venv

	# Windows
	.venv\Scripts\activate

	# Linux/macOS
	# source .venv/bin/activate

	python -m pip install --upgrade pip

	# PyTorch (NVIDIA CUDA, example)
	pip install torch --index-url https://download.pytorch.org/whl/cu128

	# PyTorch (macOS Apple Silicon, MPS)
	# pip install torch

	# Inference dependencies
	pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow
	```

	## Quick Start (Diffusers)

	This repo already stores quantized weights, so you do not need to re-run quantization during loading.

	```python
	import torch

	from diffusers import ZImagePipeline

	model_id = "ixim/Z-Image-INT8"

	if torch.cuda.is_available():
	device = "cuda"
	dtype = torch.bfloat16
	elif torch.backends.mps.is_available():
	# Apple Silicon
	device = "mps"
	dtype = torch.bfloat16
	else:
	# CPU fallback (functional but very slow for this model)
	device = "cpu"
	dtype = torch.float32

	pipe = ZImagePipeline.from_pretrained(
	model_id,
	torch_dtype=dtype,
	low_cpu_mem_usage=True,
	)

	pipe.enable_attention_slicing()

	if device == "cuda":
	pipe.enable_model_cpu_offload()
	else:
	pipe = pipe.to(device)

	prompt = "A cinematic portrait of a young woman, soft lighting, high detail"
	negative_prompt = "blurry, sad, low quality, distorted face, extra limbs, artifacts"
	# Use CPU generator for best cross-device compatibility (cpu/mps/cuda)
	generator = torch.Generator(device="cpu").manual_seed(42)

	image = pipe(
	prompt=prompt,
	negative_prompt=negative_prompt,
	height=1024,
	width=1024,
	num_inference_steps=28,
	guidance_scale=4.0,
	generator=generator,
	).images[0]

	image.save("zimage_int8_sample.png")
	print("Saved: zimage_int8_sample.png")
	```

	## macOS Notes & Troubleshooting

	- macOS Intel is no longer supported for this model in this repository.
	- If you need macOS inference, use Apple Silicon (`mps`) only.
	- On Apple Silicon, warnings like `CUDA not available` and `Disabling autocast` are expected in non-CUDA execution paths.
	- Slow speed on Mac is expected compared with high-end NVIDIA GPUs. To improve speed on Apple Silicon:
	- Ensure the script uses `mps` (as in the example above), not `cpu`.
	- Start from `height=512`, `width=512`, and fewer steps (e.g., `20~28`) before scaling up.

	## Additional Generated Samples (INT8)

	These two images are generated with this quantized model:

	### 1) `en_portrait_1024x1024.png`

	- Prompt: `A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic`

	<div align="center"><img src="test_outputs/en_portrait_1024x1024.png" width="512" /></div>

	### 2) `cn_scene_1024x1024.png`

	- Prompt: `一只橘猫趴在堆满旧书的木桌上打盹，午后阳光透过窗帘洒进来，暖色调，胶片风格，细腻毛发纹理，超高清`

	<div align="center"><img src="test_outputs/cn_scene_1024x1024.png" width="512" /></div>

	## Benchmark & Performance

	Test environment:
	- GPU: NVIDIA GeForce RTX 5090
	- Framework: PyTorch 2.10.0+cu130
	- Inference setting: 1024×1024, 50 steps, guidance=4.0, CPU offload enabled
	- Cases: 5 prompts (`portrait_01`, `portrait_02`, `landscape_01`, `scene_01`, `night_01`)

	### Aggregate Comparison (Baseline vs INT8)

	\| Metric \| Baseline \| INT8 \| Delta \|
	\|---\|---:\|---:\|---:\|
	\| Avg elapsed / image (s) \| 49.0282 \| 46.7867 \| -4.6% \|
	\| Avg sec / step \| 0.980564 \| 0.935733 \| -4.6% \|
	\| Avg peak CUDA alloc (GB) \| 12.5195 \| 7.7470 \| -38.1% \|


	> Results may vary across hardware, drivers, and PyTorch/CUDA versions.

	### Per-Case Results

	\| Case \| Baseline (s) \| INT8 (s) \| Speedup \|
	\|---\|---:\|---:\|---:\|
	\| portrait_01 \| 56.9943 \| 50.1124 \| 1.14x \|
	\| portrait_02 \| 50.3810 \| 46.0371 \| 1.09x \|
	\| landscape_01 \| 46.0286 \| 46.0192 \| 1.00x \|
	\| scene_01 \| 45.9097 \| 45.8291 \| 1.00x \|
	\| night_01 \| 45.8275 \| 45.9356 \| 1.00x \|

	## Visual Comparison (Baseline vs INT8)

	Left: Baseline. Right: INT8. (Same prompt/seed/steps.)

	\| Case \| Base \| INT8 \|
	\|---\|---\|---\|
	\| portrait_01 \| ![](zimage_quanto_bench_results/images/baseline/portrait_01_seed46.png) \| ![](zimage_quanto_bench_results/images/int8/portrait_01_seed46.png) \|
	\| portrait_02 \| ![](zimage_quanto_bench_results/images/baseline/portrait_02_seed111.png) \| ![](zimage_quanto_bench_results/images/int8/portrait_02_seed111.png) \|
	\| landscape_01 \| ![](zimage_quanto_bench_results/images/baseline/landscape_01_seed123.png) \| ![](zimage_quanto_bench_results/images/int8/landscape_01_seed123.png) \|
	\| scene_01 \| ![](zimage_quanto_bench_results/images/baseline/scene_01_seed777.png) \| ![](zimage_quanto_bench_results/images/int8/scene_01_seed777.png) \|
	\| night_01 \| ![](zimage_quanto_bench_results/images/baseline/night_01_seed2026.png) \| ![](zimage_quanto_bench_results/images/int8/night_01_seed2026.png) \|

	## Limitations

	- This is weight-only INT8 quantization; activation precision is unchanged.
	- Minor visual differences may appear on some prompts.
	- `enable_model_cpu_offload()` can change latency distribution across pipeline stages.
	- For extreme resolutions / very long step counts, validate quality and stability first.

	## Intended Use

	Recommended for:
	- Running Z-Image with lower VRAM usage.
	- Improving throughput while keeping quality close to baseline.

	Not recommended as-is for:
	- Safety-critical decision workflows.
	- High-risk generation use cases without additional review/guardrails.

	## Citation

	If you use this model, please cite/reference the upstream model and toolchain:
	- Tongyi-MAI/Z-Image
	- Hugging Face Diffusers
	- optimum-quanto