ixim commited on
Commit
c84bbf4
·
verified ·
1 Parent(s): 41ceaa1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +213 -191
README.md CHANGED
@@ -1,191 +1,213 @@
1
- ---
2
- language:
3
- - en
4
- license: other
5
- library_name: diffusers
6
- pipeline_tag: text-to-image
7
- tags:
8
- - text-to-image
9
- - diffusers
10
- - quanto
11
- - int8
12
- - z-image
13
- - transformer-quantization
14
- base_model:
15
- - Tongyi-MAI/Z-Image
16
- base_model_relation: quantized
17
- ---
18
-
19
- # Z-Image INT8 (Quanto)
20
-
21
- This repository provides an INT8-quantized variant of [Tongyi-MAI/Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image):
22
- - **Only** the `transformer` is quantized with **Quanto weight-only INT8**.
23
- - `text_encoder`, `vae`, `scheduler`, and `tokenizer` remain unchanged.
24
- - Inference API stays compatible with `diffusers.ZImagePipeline`.
25
-
26
- > Please follow the original upstream model license and usage terms. `license: other` means this repo inherits upstream licensing constraints.
27
-
28
- ## Model Details
29
-
30
- - **Base model**: `Tongyi-MAI/Z-Image`
31
- - **Quantization method**: `optimum-quanto` (weight-only INT8)
32
- - **Quantized part**: `transformer`
33
- - **Compute dtype**: `bfloat16`
34
- - **Pipeline**: `diffusers.ZImagePipeline`
35
- - **Negative prompt support**: Yes (same pipeline API as the base model)
36
-
37
- ## Files
38
-
39
- Key files in this repository:
40
- - `model_index.json`
41
- - `transformer/diffusion_pytorch_model.safetensors` (INT8-quantized weights)
42
- - `text_encoder/*`, `vae/*`, `scheduler/*`, `tokenizer/*` (not quantized)
43
- - `zimage_quanto_bench_results/*` (benchmark metrics and baseline-vs-int8 images)
44
- - `test_outputs/*` (generated examples)
45
-
46
- ## Installation
47
-
48
- Python 3.10+ is recommended.
49
-
50
- ```bash
51
- # Create env (optional)
52
- python -m venv .venv
53
-
54
- # Windows
55
- .venv\Scripts\activate
56
-
57
- # Linux/macOS
58
- # source .venv/bin/activate
59
-
60
- python -m pip install --upgrade pip
61
-
62
- # PyTorch (NVIDIA CUDA, example)
63
- pip install torch --index-url https://download.pytorch.org/whl/cu128
64
-
65
- # PyTorch (macOS / CPU-only example)
66
- # pip install torch
67
-
68
- # Inference dependencies
69
- pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow
70
- ```
71
-
72
- ## Quick Start (Diffusers)
73
-
74
- This repo already stores quantized weights, so you do **not** need to re-run quantization during loading.
75
-
76
- ```python
77
- import torch
78
- from diffusers import ZImagePipeline
79
-
80
- model_id = "ixim/Z-Image-INT8"
81
-
82
- device = "cuda" if torch.cuda.is_available() else "cpu"
83
- dtype = torch.bfloat16 if device == "cuda" else torch.float32
84
-
85
- pipe = ZImagePipeline.from_pretrained(
86
- model_id,
87
- torch_dtype=dtype,
88
- low_cpu_mem_usage=True,
89
- )
90
-
91
- if device == "cuda":
92
- pipe.enable_model_cpu_offload()
93
- else:
94
- pipe = pipe.to("cpu")
95
-
96
- prompt = "A cinematic portrait of a young woman, soft lighting, high detail"
97
- negative_prompt = "blurry, low quality, distorted face, extra limbs, artifacts"
98
- generator = torch.Generator(device=device).manual_seed(42)
99
-
100
- image = pipe(
101
- prompt=prompt,
102
- negative_prompt=negative_prompt,
103
- height=1024,
104
- width=1024,
105
- num_inference_steps=28,
106
- guidance_scale=4.0,
107
- generator=generator,
108
- ).images[0]
109
-
110
- image.save("zimage_int8_sample.png")
111
- print("Saved: zimage_int8_sample.png")
112
- ```
113
-
114
- ## Additional Generated Samples (INT8)
115
-
116
- These two images are generated with this quantized model:
117
-
118
- ### 1) `en_portrait_1024x1024.png`
119
-
120
- - **Prompt**: `A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic`
121
-
122
- <div align="center"><img src="test_outputs/en_portrait_1024x1024.png" width="512" /></div>
123
-
124
- ### 2) `cn_scene_1024x1024.png`
125
-
126
- - **Prompt**: `一只橘猫趴在堆满旧书的木桌上打盹,午后阳光透过窗帘洒进来,暖色调,胶片风格,细腻毛发纹理,超高清`
127
-
128
- <div align="center"><img src="test_outputs/cn_scene_1024x1024.png" width="512" /></div>
129
-
130
- ## Benchmark & Performance
131
-
132
- Test environment:
133
- - GPU: NVIDIA GeForce RTX 5090
134
- - Framework: PyTorch 2.10.0+cu130
135
- - Inference setting: 1024×1024, 28 steps, guidance=4.0, CPU offload enabled
136
- - Cases: 4 prompts (`portrait_01`, `portrait_02`, `scene_01`, `night_01`)
137
-
138
- ### Aggregate Comparison (Baseline vs INT8)
139
-
140
- | Metric | Baseline | INT8 | Delta |
141
- |---|---:|---:|---:|
142
- | Avg elapsed / image (s) | 51.7766 | 39.5662 | **-23.6%** |
143
- | Avg sec / step | 1.8492 | 1.4131 | **-23.6%** |
144
- | Avg peak CUDA alloc (GB) | 12.5195 | 7.7470 | **-38.1%** |
145
-
146
-
147
- > Results may vary across hardware, drivers, and PyTorch/CUDA versions.
148
-
149
- ### Per-Case Results
150
-
151
- | Case | Baseline (s) | INT8 (s) | Speedup |
152
- |---|---:|---:|---:|
153
- | portrait_01 | 99.9223 | 60.6768 | 1.65x |
154
- | portrait_02 | 37.4116 | 32.8863 | 1.14x |
155
- | scene_01 | 34.9946 | 32.2035 | 1.09x |
156
- | night_01 | 34.7780 | 32.4981 | 1.07x |
157
-
158
- ## Visual Comparison (Baseline vs INT8)
159
-
160
- Left: Baseline. Right: INT8. (Same prompt/seed/steps.)
161
-
162
- | Case | Base | INT8 |
163
- |---|---|---|
164
- | portrait_01 | ![](zimage_quanto_bench_results/images/baseline/portrait_01_seed46.png) | ![](zimage_quanto_bench_results/images/int8/portrait_01_seed46.png) |
165
- | portrait_02 | ![](zimage_quanto_bench_results/images/baseline/portrait_02_seed123.png) | ![](zimage_quanto_bench_results/images/int8/portrait_02_seed123.png) |
166
- | scene_01 | ![](zimage_quanto_bench_results/images/baseline/scene_01_seed777.png) | ![](zimage_quanto_bench_results/images/int8/scene_01_seed777.png) |
167
- | night_01 | ![](zimage_quanto_bench_results/images/baseline/night_01_seed2026.png) | ![](zimage_quanto_bench_results/images/int8/night_01_seed2026.png) |
168
-
169
- ## Limitations
170
-
171
- - This is **weight-only INT8** quantization; activation precision is unchanged.
172
- - Minor visual differences may appear on some prompts.
173
- - `enable_model_cpu_offload()` can change latency distribution across pipeline stages.
174
- - For extreme resolutions / very long step counts, validate quality and stability first.
175
-
176
- ## Intended Use
177
-
178
- Recommended for:
179
- - Running Z-Image with lower VRAM usage.
180
- - Improving throughput while keeping quality close to baseline.
181
-
182
- Not recommended as-is for:
183
- - Safety-critical decision workflows.
184
- - High-risk generation use cases without additional review/guardrails.
185
-
186
- ## Citation
187
-
188
- If you use this model, please cite/reference the upstream model and toolchain:
189
- - Tongyi-MAI/Z-Image
190
- - Hugging Face Diffusers
191
- - optimum-quanto
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: other
5
+ library_name: diffusers
6
+ pipeline_tag: text-to-image
7
+ tags:
8
+ - text-to-image
9
+ - diffusers
10
+ - quanto
11
+ - int8
12
+ - z-image
13
+ - transformer-quantization
14
+ base_model:
15
+ - Tongyi-MAI/Z-Image
16
+ ---
17
+
18
+ # Z-Image INT8 (Quanto)
19
+
20
+ This repository provides an INT8-quantized variant of [Tongyi-MAI/Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image):
21
+ - **Only** the `transformer` is quantized with **Quanto weight-only INT8**.
22
+ - `text_encoder`, `vae`, `scheduler`, and `tokenizer` remain unchanged.
23
+ - Inference API stays compatible with `diffusers.ZImagePipeline`.
24
+
25
+ > Please follow the original upstream model license and usage terms. `license: other` means this repo inherits upstream licensing constraints.
26
+
27
+ ## Model Details
28
+
29
+ - **Base model**: `Tongyi-MAI/Z-Image`
30
+ - **Quantization method**: `optimum-quanto` (weight-only INT8)
31
+ - **Quantized part**: `transformer`
32
+ - **Compute dtype**: `bfloat16`
33
+ - **Pipeline**: `diffusers.ZImagePipeline`
34
+ - **Negative prompt support**: Yes (same pipeline API as the base model)
35
+
36
+ ## Files
37
+
38
+ Key files in this repository:
39
+ - `model_index.json`
40
+ - `transformer/diffusion_pytorch_model.safetensors` (INT8-quantized weights)
41
+ - `text_encoder/*`, `vae/*`, `scheduler/*`, `tokenizer/*` (not quantized)
42
+ - `zimage_quanto_bench_results/*` (benchmark metrics and baseline-vs-int8 images)
43
+ - `test_outputs/*` (generated examples)
44
+
45
+ ## Installation
46
+
47
+ Python 3.10+ is recommended.
48
+
49
+ ```bash
50
+ # Create env (optional)
51
+ python -m venv .venv
52
+
53
+ # Windows
54
+ .venv\Scripts\activate
55
+
56
+ # Linux/macOS
57
+ # source .venv/bin/activate
58
+
59
+ python -m pip install --upgrade pip
60
+
61
+ # PyTorch (NVIDIA CUDA, example)
62
+ pip install torch --index-url https://download.pytorch.org/whl/cu128
63
+
64
+ # PyTorch (macOS / CPU-only example)
65
+ # pip install torch
66
+
67
+ # Inference dependencies
68
+ pip install diffusers transformers accelerate safetensors sentencepiece optimum-quanto pillow
69
+
70
+ # Recommended minimum versions (helps avoid backend compatibility issues)
71
+ pip install -U "torch>=2.4" "diffusers>=0.36.0" "accelerate>=0.33"
72
+ ```
73
+
74
+ ## Quick Start (Diffusers)
75
+
76
+ This repo already stores quantized weights, so you do **not** need to re-run quantization during loading.
77
+
78
+ ```python
79
+ import torch
80
+ from diffusers import ZImagePipeline
81
+
82
+ model_id = "ixim/Z-Image-INT8"
83
+
84
+ if torch.cuda.is_available():
85
+ device = "cuda"
86
+ dtype = torch.bfloat16
87
+ elif torch.backends.mps.is_available():
88
+ # Apple Silicon
89
+ device = "mps"
90
+ dtype = torch.float16
91
+ else:
92
+ # Intel Mac / CPU-only
93
+ device = "cpu"
94
+ dtype = torch.float32
95
+
96
+ pipe = ZImagePipeline.from_pretrained(
97
+ model_id,
98
+ torch_dtype=dtype,
99
+ low_cpu_mem_usage=True,
100
+ )
101
+
102
+ if device == "cuda":
103
+ pipe.enable_model_cpu_offload()
104
+ else:
105
+ pipe = pipe.to(device)
106
+
107
+ prompt = "A cinematic portrait of a young woman, soft lighting, high detail"
108
+ negative_prompt = "blurry, low quality, distorted face, extra limbs, artifacts"
109
+ # Use CPU generator for best cross-device compatibility (cpu/mps/cuda)
110
+ generator = torch.Generator(device="cpu").manual_seed(42)
111
+
112
+ image = pipe(
113
+ prompt=prompt,
114
+ negative_prompt=negative_prompt,
115
+ height=1024,
116
+ width=1024,
117
+ num_inference_steps=28,
118
+ guidance_scale=4.0,
119
+ generator=generator,
120
+ ).images[0]
121
+
122
+ image.save("zimage_int8_sample.png")
123
+ print("Saved: zimage_int8_sample.png")
124
+ ```
125
+
126
+ ## macOS Notes & Troubleshooting
127
+
128
+ - `AttributeError: module 'torch' has no attribute 'xpu'` is usually a backend/version compatibility issue in the local environment, not a model issue.
129
+ - Fix it by upgrading to recent versions:
130
+ - `pip install -U "torch>=2.4" "diffusers>=0.36.0" "accelerate>=0.33"`
131
+ - On Apple Silicon, warnings like `CUDA not available` and `Disabling autocast` are expected in non-CUDA execution paths.
132
+ - Slow speed on Mac is expected compared with high-end NVIDIA GPUs. To improve speed on Apple Silicon:
133
+ - Ensure the script uses `mps` (as in the example above), not `cpu`.
134
+ - Start from `height=512`, `width=512`, and fewer steps (e.g., `20~28`) before scaling up.
135
+
136
+ ## Additional Generated Samples (INT8)
137
+
138
+ These two images are generated with this quantized model:
139
+
140
+ ### 1) `en_portrait_1024x1024.png`
141
+
142
+ - **Prompt**: `A cinematic portrait of a young woman standing by the window, golden hour sunlight, shallow depth of field, film grain, ultra-detailed skin texture, photorealistic`
143
+
144
+ <div align="center"><img src="test_outputs/en_portrait_1024x1024.png" width="512" /></div>
145
+
146
+ ### 2) `cn_scene_1024x1024.png`
147
+
148
+ - **Prompt**: `一只橘猫趴在堆满旧书的木桌上打盹,午后阳光透过窗帘洒进来,暖色调,胶片风格,细腻毛发纹理,超高清`
149
+
150
+ <div align="center"><img src="test_outputs/cn_scene_1024x1024.png" width="512" /></div>
151
+
152
+ ## Benchmark & Performance
153
+
154
+ Test environment:
155
+ - GPU: NVIDIA GeForce RTX 5090
156
+ - Framework: PyTorch 2.10.0+cu130
157
+ - Inference setting: 1024×1024, 28 steps, guidance=4.0, CPU offload enabled
158
+ - Cases: 4 prompts (`portrait_01`, `portrait_02`, `scene_01`, `night_01`)
159
+
160
+ ### Aggregate Comparison (Baseline vs INT8)
161
+
162
+ | Metric | Baseline | INT8 | Delta |
163
+ |---|---:|---:|---:|
164
+ | Avg elapsed / image (s) | 51.7766 | 39.5662 | **-23.6%** |
165
+ | Avg sec / step | 1.8492 | 1.4131 | **-23.6%** |
166
+ | Avg peak CUDA alloc (GB) | 12.5195 | 7.7470 | **-38.1%** |
167
+
168
+
169
+ > Results may vary across hardware, drivers, and PyTorch/CUDA versions.
170
+
171
+ ### Per-Case Results
172
+
173
+ | Case | Baseline (s) | INT8 (s) | Speedup |
174
+ |---|---:|---:|---:|
175
+ | portrait_01 | 99.9223 | 60.6768 | 1.65x |
176
+ | portrait_02 | 37.4116 | 32.8863 | 1.14x |
177
+ | scene_01 | 34.9946 | 32.2035 | 1.09x |
178
+ | night_01 | 34.7780 | 32.4981 | 1.07x |
179
+
180
+ ## Visual Comparison (Baseline vs INT8)
181
+
182
+ Left: Baseline. Right: INT8. (Same prompt/seed/steps.)
183
+
184
+ | Case | Base | INT8 |
185
+ |---|---|---|
186
+ | portrait_01 | ![](zimage_quanto_bench_results/images/baseline/portrait_01_seed46.png) | ![](zimage_quanto_bench_results/images/int8/portrait_01_seed46.png) |
187
+ | portrait_02 | ![](zimage_quanto_bench_results/images/baseline/portrait_02_seed123.png) | ![](zimage_quanto_bench_results/images/int8/portrait_02_seed123.png) |
188
+ | scene_01 | ![](zimage_quanto_bench_results/images/baseline/scene_01_seed777.png) | ![](zimage_quanto_bench_results/images/int8/scene_01_seed777.png) |
189
+ | night_01 | ![](zimage_quanto_bench_results/images/baseline/night_01_seed2026.png) | ![](zimage_quanto_bench_results/images/int8/night_01_seed2026.png) |
190
+
191
+ ## Limitations
192
+
193
+ - This is **weight-only INT8** quantization; activation precision is unchanged.
194
+ - Minor visual differences may appear on some prompts.
195
+ - `enable_model_cpu_offload()` can change latency distribution across pipeline stages.
196
+ - For extreme resolutions / very long step counts, validate quality and stability first.
197
+
198
+ ## Intended Use
199
+
200
+ Recommended for:
201
+ - Running Z-Image with lower VRAM usage.
202
+ - Improving throughput while keeping quality close to baseline.
203
+
204
+ Not recommended as-is for:
205
+ - Safety-critical decision workflows.
206
+ - High-risk generation use cases without additional review/guardrails.
207
+
208
+ ## Citation
209
+
210
+ If you use this model, please cite/reference the upstream model and toolchain:
211
+ - Tongyi-MAI/Z-Image
212
+ - Hugging Face Diffusers
213
+ - optimum-quanto