yongqiang commited on
Commit
7c12e87
·
1 Parent(s): 5f3ba5a

update readme and add a new config

Browse files
README.md CHANGED
@@ -1,3 +1,403 @@
1
  ---
2
  license: bsd-3-clause
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: bsd-3-clause
3
  ---
4
+
5
+ # Z-Image-Turbo on AXERA AX650N
6
+
7
+ This project provides a complete implementation for deploying the Z-Image-Turbo diffusion model on AXERA AX650N NPU hardware. Z-Image-Turbo is a high-performance text-to-image generation model that leverages advanced diffusion techniques to produce high-quality images with fast inference speed.
8
+
9
+ ## Table of Contents
10
+
11
+ - [Overview](#overview)
12
+ - [Requirements](#requirements)
13
+ - [Project Structure](#project-structure)
14
+ - [Model Components](#model-components)
15
+ - [1. Transformer Module](#1-transformer-module)
16
+ - [2. VAE Decoder Module](#2-vae-decoder-module)
17
+ - [Complete Inference Pipeline](#complete-inference-pipeline)
18
+ - [Advanced Usage](#advanced-usage)
19
+ - [Technical Support](#technical-support)
20
+
21
+ ## Overview
22
+
23
+ The Z-Image-Turbo model consists of three main components:
24
+
25
+ 1. **Text Encoder**: Converts text prompts into embeddings
26
+ 2. **Transformer**: Core diffusion model that processes latent representations
27
+ 3. **VAE (Variational Autoencoder)**: Encodes/decodes between pixel space and latent space
28
+
29
+ ### Deployment Strategy
30
+
31
+ The deployment architecture is optimized for AXERA AX650N with the following design decisions:
32
+
33
+ - **Text Encoder**: Currently runs on PyTorch for simplicity and faster development iteration. This component uses the Qwen3 model and can be converted to axmodel format for pure NPU inference in future releases to achieve end-to-end NPU acceleration.
34
+ - **Transformer**: Fully converted to axmodel format and runs on NPU through model partitioning and subgraph optimization, achieving optimal performance on the target hardware.
35
+ - **VAE**: Both encoder and decoder are converted to axmodel format for complete NPU acceleration, enabling fast image encoding and decoding operations.
36
+
37
+ ## Requirements
38
+
39
+ This project requires the following Python environment and dependencies:
40
+
41
+ ```sh
42
+ Python 3.9.20
43
+ torch 2.7.0
44
+ torchvision 0.22.0
45
+ transformers 4.53.1
46
+ diffusers 0.32.1
47
+ ```
48
+
49
+ **Additional Dependencies:**
50
+ - ONNX Runtime (for ONNX model inference and validation)
51
+ - onnxslim (for ONNX model optimization)
52
+ - numpy (for numerical operations and calibration data handling)
53
+ - Pulsar2 toolchain (for AXERA AX650N model compilation)
54
+
55
+ **Hardware Requirements:**
56
+ - AXERA AX650N development board for deployment
57
+ - x86/ARM Linux system for model conversion and compilation
58
+
59
+ ## Project Structure
60
+
61
+ ```sh
62
+ Z-Image-Turbo/
63
+ ├── original_onnx/ # Exported ONNX models (original format)
64
+ │ ├── vae_decoder_simp_slim.onnx
65
+ │ ├── vae_encoder_simp_slim.onnx
66
+ │ └── z_image_transformer_body_only_simp_slim.onnx
67
+ ├── text_encoder_axmodel/ # Text encoder models in axmodel format
68
+ │ ├── model.embed_tokens.weight.npy
69
+ │ ├── qwen3_p128_l0_together.axmodel
70
+ │ ├── qwen3_p128_l1_together.axmodel
71
+ │ └── ... (36 layer models for Qwen3)
72
+ ├── transformer_axmodel/ # Transformer subgraph models in axmodel format
73
+ │ ├── auto_00_model_layers_29_Add_4_output_0_to_sample_auto.axmodel
74
+ │ ├── cfg_00_timestep_to_model_t_embedder_mlp_mlp_2_Gemm_output_0_config.axmodel
75
+ │ └── ... (compiled subgraph models)
76
+ ├── transformer_onnx/ # Transformer models in ONNX format
77
+ ├── vae_model/ # VAE models (both ONNX and axmodel formats)
78
+ ├── VideoX-Fun/ # Main conversion and inference code
79
+ └── README.md # This documentation
80
+ ```
81
+
82
+ ## Model Components
83
+
84
+ ### 1. Transformer Module
85
+
86
+ The transformer module is the core component responsible for the diffusion process. It iteratively processes latent representations to generate high-quality images from noise. Due to the model's complexity and size, we employ a subgraph partitioning strategy to optimize deployment on the AX650N NPU.
87
+
88
+ #### Step 1: Export to ONNX Format
89
+
90
+ First, export the transformer model to ONNX format (without ControlNet support):
91
+
92
+ ```sh
93
+ python scripts/z_image/export_transformer_body_onnx.py \
94
+ --output onnx-models-512x512/z_image_transformer_body_only_512x512.onnx \
95
+ --height 512 --width 512 --sequence-length 128 \
96
+ --latent-downsample-factor 8 \
97
+ --dtype fp32 \
98
+ --skip-slim
99
+ ```
100
+
101
+ **Parameters:**
102
+ - `--output`: Output path for the ONNX model
103
+ - `--height`, `--width`: Target image dimensions (512x512)
104
+ - `--sequence-length`: Maximum sequence length for text embeddings (128 tokens)
105
+ - `--latent-downsample-factor`: VAE downsample factor (8x)
106
+ - `--dtype`: Data type (fp32 for highest accuracy)
107
+ - `--skip-slim`: Skip ONNX simplification (optional)
108
+
109
+ > **Note:** If you don't use `--skip-slim`, the model will be automatically simplified and the output will be named: `z_image_transformer_body_only_512x512_simp_slim.onnx`
110
+
111
+ #### Step 2: Collect Calibration Data
112
+
113
+ Collect calibration dataset from the original model for quantization. This step generates representative input data that will be used during the quantization process:
114
+
115
+ ```sh
116
+ python ./examples/z_image_fun/collect_onnx_inputs.py \
117
+ --model_name models/Diffusion_Transformer/Z-Image-Turbo/ \
118
+ --output_dir transformer_body_only_512x512_simp_slim/calibration \
119
+ --height 512 --width 512 \
120
+ --max_sequence_length 128
121
+ ```
122
+
123
+ This command generates calibration data by running the model with various prompts and diffusion steps, capturing the actual input distributions that the model will encounter during inference.
124
+
125
+ #### Step 3: Split ONNX Model into Subgraphs
126
+
127
+ Split the monolithic ONNX model into multiple subgraphs for better memory management and compilation optimization:
128
+
129
+ ```sh
130
+ python ./scripts/split_onnx_by_subconfig.py \
131
+ --model ./onnx-models-512x512/z_image_transformer_body_only_512x512_simp_slim.onnx \
132
+ --config ./pulsar2_configs/transformers_subgraph_512x512.json \
133
+ --output-dir ./transformers_body_only_512_512_split_onnx \
134
+ --verify \
135
+ --input-data ./transformer_body_only_512x512_simp_slim/calibration/transformer_inputs_prompt000_step00.npy \
136
+ --providers CPUExecutionProvider
137
+ ```
138
+
139
+ The subgraph configuration file (`transformers_subgraph_512x512.json`) defines the splitting strategy, determining how the model is partitioned into smaller, manageable pieces that fit within the NPU's constraints.
140
+
141
+ #### Step 4: Collect Subgraph Calibration Data
142
+
143
+ After splitting, collect calibration data for each individual subgraph:
144
+
145
+ ```sh
146
+ python examples/z_image_fun/collect_subgraph_inputs.py \
147
+ --onnx ./onnx-models-512x512/z_image_transformer_body_only_512x512_simp_slim.onnx \
148
+ --subgraph-config ./pulsar2_configs/transformers_subgraph_512x512.json \
149
+ --output-dir ./transformer_body_only_512x512_simp_slim/subgraph-calib \
150
+ --tar-list-file ./transformer_body_only_512x512_simp_slim/subgraph-calib/paths.txt \
151
+ --skip-existing
152
+ ```
153
+
154
+ For collecting additional calibration data with different resolutions (for instance: 1728x992):
155
+
156
+ ```sh
157
+ python examples/z_image_fun/collect_subgraph_inputs.py \
158
+ --onnx ./onnx-models-1728x992/z_image_transformer_body_only_1728x992_simp_slim.onnx \
159
+ --subgraph-config ./pulsar2_configs/transformers_subgraph_1728x992.json \
160
+ --output-dir ./transformer_body_only_1728x992_simp_slim/subgraph-calib \
161
+ --tar-list-file ./transformer_body_only_1728x992_simp_slim/subgraph-calib/paths.txt \
162
+ --sample-size 1728 992 \
163
+ --max-seq-len 256
164
+ ```
165
+
166
+ #### Step 5: Generate Compilation Configuration Files
167
+
168
+ Automatically generate individual compilation configuration files for each subgraph:
169
+
170
+ ```sh
171
+ python ./scripts/generate_subgraph_configs.py \
172
+ --tar-list-file ./transformer_body_only_512x512_simp_slim/subgraph-calib/paths.txt \
173
+ --output-config-dir pulsar2_configs/subgraphs_512x512
174
+ ```
175
+
176
+ This step creates tailored configuration files for each subgraph, specifying quantization settings, calibration data paths, and compilation options.
177
+
178
+ > **Important:** After generating the sub-ONNX files, you need to apply ONNX simplification (`onnxslim`) to each subgraph for optimal performance.
179
+
180
+ #### Step 6: Compile All Subgraphs
181
+
182
+ Compile all subgraphs using the Pulsar2 toolchain:
183
+
184
+ ```sh
185
+ ./compile_all_subgraphs.sh \
186
+ --onnx-dir ./transformers_body_only_512_512_split_onnx \
187
+ --config-dir pulsar2_configs/subgraphs_512x512 \
188
+ --output-base-dir ./compiled_transformers_body_only_512x512/out_all \
189
+ --final-output-dir ./compiled_transformers_body_only_512x512/out_final
190
+ ```
191
+
192
+ **Output Directories:**
193
+ - `out_all`: Contains compilation logs and intermediate files for all subgraphs
194
+ - `out_final`: Contains only the successfully compiled axmodel files, ready for deployment
195
+
196
+ The compilation process converts each ONNX subgraph into an optimized axmodel format that can run efficiently on the AX650N NPU.
197
+
198
+ ### 2. VAE Decoder Module
199
+
200
+ The Variational Autoencoder (VAE) is responsible for converting between the latent space representation and pixel space. The decoder takes the denoised latent representation from the transformer and generates the final RGB image.
201
+
202
+ #### Step 1: Export VAE to ONNX Format
203
+
204
+ Export both the VAE encoder and decoder to ONNX format:
205
+
206
+ ```sh
207
+ python scripts/z_image_fun/export_vae_onnx.py \
208
+ --model-root models/Diffusion_Transformer/Z-Image-Turbo/ \
209
+ --height 512 --width 512 \
210
+ --encoder-output onnx-models-512x512/vae_encoder.onnx \
211
+ --decoder-output onnx-models-512x512/vae_decoder.onnx \
212
+ --dtype fp32 \
213
+ --save-calib-inputs \
214
+ --calib-dir onnx-calibration-512x512 \
215
+ --skip-ort-check
216
+ ```
217
+
218
+ **Parameters:**
219
+ - `--model-root`: Path to the Z-Image-Turbo model
220
+ - `--encoder-output`, `--decoder-output`: Output paths for the encoder and decoder ONNX models
221
+ - `--save-calib-inputs`: Save calibration inputs for quantization
222
+ - `--calib-dir`: Directory to store calibration data
223
+ - `--skip-ort-check`: Skip ONNX Runtime validation (useful when ORT has compatibility issues)
224
+
225
+ #### Step 2: Create Compilation Configuration
226
+
227
+ Create a configuration file for the VAE decoder compilation. Example configuration file: `pulsar2_configs/vae_decoder.json`
228
+
229
+ This configuration should specify:
230
+ - Input/output tensor names and shapes
231
+ - Quantization strategy (e.g., int8, mixed precision)
232
+ - Calibration data paths
233
+ - Hardware target (AX650)
234
+
235
+ #### Step 3: Compile VAE Decoder
236
+
237
+ Compile the ONNX model to axmodel format using Pulsar2:
238
+
239
+ ```sh
240
+ pulsar2 build \
241
+ --output_dir ./compiled_output_vae_decoder \
242
+ --config pulsar2_configs/vae_decoder.json \
243
+ --npu_mode NPU3 \
244
+ --input onnx-models/vae_decoder_simp_slim.onnx \
245
+ --target_hardware AX650
246
+ ```
247
+
248
+ **Parameters:**
249
+ - `--output_dir`: Output directory for compiled models
250
+ - `--config`: Path to the compilation configuration file
251
+ - `--npu_mode`: NPU mode (NPU3 for maximum performance on AX650N)
252
+ - `--target_hardware`: Target hardware platform (AX650)
253
+
254
+ The compiled VAE decoder will be saved in the output directory and can be deployed to the AX650N board.
255
+
256
+ ## Complete Inference Pipeline
257
+
258
+ After compiling all components, you can run the complete text-to-image inference pipeline on the AXERA AX650N development board.
259
+
260
+ ### Running on the Development Board
261
+
262
+ 1. Transfer all compiled axmodel files to the development board
263
+ 2. Ensure all dependencies are installed
264
+ 3. Run the inference script:
265
+
266
+ ```sh
267
+ python3 examples/z_image_fun/launcher_axmodel.py \
268
+ --transformer-config pulsar2_configs/transformers_subgraph.json \
269
+ --transformer-subgraph-dir ../transformer_axmodel \
270
+ --vae-axmodel ../vae_model/vae_decoder.axmodel
271
+ ```
272
+
273
+ **Parameters:**
274
+ - `--transformer-config`: Configuration file that defines the subgraph structure
275
+ - `--transformer-subgraph-dir`: Directory containing all compiled transformer subgraph axmodels
276
+ - `--vae-axmodel`: Path to the compiled VAE decoder axmodel
277
+
278
+ The launcher script will:
279
+ 1. Load the text encoder (PyTorch)
280
+ 2. Process input prompts into embeddings
281
+ 3. Run the transformer subgraphs sequentially on NPU
282
+ 4. Decode the latent representation using VAE decoder on NPU
283
+ 5. Output the final generated image
284
+
285
+ ### Example Output
286
+
287
+ Here's an example of the inference process running on the AX650N development board:
288
+
289
+ ```sh
290
+ root@ax650 Z-Image-Turbo/VideoX-Fun $ python3 examples/z_image_fun/launcher_axmodel.py \
291
+ --transformer-config pulsar2_configs/transformers_subgraph.json \
292
+ --transformer-subgraph-dir ../transformer_axmodel \
293
+ --vae-axmodel ../vae_model/vae_decoder.axmodel
294
+
295
+ [INFO] Available providers: ['AxEngineExecutionProvider']
296
+ /root/yongqiang/push_hugging_face/Z-Image-Turbo/VideoX-Fun/videox_fun/dist/wan_xfuser.py:22: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
297
+ @amp.autocast(enabled=False)
298
+ ...
299
+ /root/yongqiang/push_hugging_face/Z-Image-Turbo/VideoX-Fun/videox_fun/models/wan_audio_injector.py:114: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
300
+ @amp.autocast(enabled=False)
301
+ /root/yongqiang/push_hugging_face/Z-Image-Turbo/VideoX-Fun/videox_fun/models/wan_transformer3d_s2v.py:55: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
302
+ @amp.autocast(enabled=False)
303
+ 2026-01-15 15:55:55.577 | INFO | __main__:main:425 - 使用的 prompt: sunrise over alpine mountains, low clouds in valleys, god rays, ultra-detailed landscape
304
+ `torch_dtype` is deprecated! Use `dtype` instead!
305
+ Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 2.26it/s]
306
+ The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
307
+ ^@^@^@[INFO] Using provider: AxEngineExecutionProvider
308
+ [INFO] Chip type: ChipType.MC50
309
+ [INFO] VNPU type: VNPUType.DISABLED
310
+ [INFO] Engine version: 2.12.0s
311
+ [INFO] Model type: 2 (triple core)
312
+ [INFO] Compiler version: 5.1-patch1-dirty 5c5e711b-dirty
313
+ AX Denoising: 0%| | 0/9 [00:00<?, ?it/s][INFO] Using provider: AxEngineExecutionProvider
314
+ [INFO] Model type: 2 (triple core)
315
+ [INFO] Compiler version: 5.1-patch1-dirty 5c5e711b-dirty
316
+ 2026-01-15 15:58:44.111 | INFO | __main__:_get_session:301 - 加载子图 session: cfg_00 from cfg_00_timestep_to_model_t_embedder_mlp_mlp_2_Gemm_output_0_config.axmodel
317
+ [INFO] Using provider: AxEngineExecutionProvider
318
+ [INFO] Model type: 2 (triple core)
319
+ [INFO] Compiler version: 5.1-patch1-dirty 5c5e711b-dirty
320
+ 2026-01-15 15:58:48.882 | INFO | __main__:_get_session:301 - 加载子图 session: cfg_01 from cfg_01_prompt_embeds_to_model_Slice_1_output_0_config.axmodel
321
+ [INFO] Using provider: AxEngineExecutionProvider
322
+ [INFO] Model type: 2 (triple core)
323
+ [INFO] Compiler version: 5.1-patch1-dirty 5c5e711b-dirty
324
+ ...
325
+ 2026-01-15 16:00:08.612 | INFO | __main__:_get_session:301 - 加载子图 session: cfg_30 from cfg_30_model_layers_26_Add_4_output_0_to_model_layers_27_Add_4_output_0_config.axmodel
326
+ [INFO] Using provider: AxEngineExecutionProvider
327
+ [INFO] Model type: 2 (triple core)
328
+ [INFO] Compiler version: 5.1-patch1 5c5e711b
329
+ 2026-01-15 16:00:11.179 | INFO | __main__:_get_session:301 - 加载子图 session: cfg_31 from cfg_31_model_layers_27_Add_4_output_0_to_model_layers_28_Add_4_output_0_config.axmodel
330
+ [INFO] Using provider: AxEngineExecutionProvider
331
+ [INFO] Model type: 2 (triple core)
332
+ [INFO] Compiler version: 5.1-patch1 5c5e711b
333
+ 2026-01-15 16:00:13.868 | INFO | __main__:_get_session:301 - 加载子图 session: cfg_32 from cfg_32_model_layers_28_Add_4_output_0_to_model_layers_29_Add_4_output_0_config.axmodel
334
+ AX Denoising: 22%|███████████████▎ | 2/9 [01:36<04:45, 40.84s/it]AX Denoising: 100%|█████████████████████████████████████████████████████████████████████| 9/9 [02:20<00:00, 15.60s/it]
335
+ [INFO] Using provider: AxEngineExecutionProvider
336
+ [INFO] Model type: 2 (triple core)
337
+ [INFO] Compiler version: 5.1-patch1 5c5e711b
338
+ 2026-01-15 16:01:06.972 | INFO | __main__:main:537 - AXModel 推理完成,结果保存到 /root/yongqiang/push_hugging_face/Z-Image-Turbo/VideoX-Fun/samples/z-image-t2i-axmodel/z_image_axmodel_2.png
339
+ ```
340
+
341
+ The inference process demonstrates the complete pipeline working on the hardware, including:
342
+ - Model loading and initialization (~3 minutes for all 33 subgraphs)
343
+ - Denoising iterations (9 steps, ~2 minutes 20 seconds total)
344
+ - Final image generation and saving
345
+
346
+ ### Known Limitations
347
+
348
+ **Quantization Accuracy**: Unfortunately, due to quantization precision limitations, the axmodel inference results show some differences compared to the original ONNX model outputs. This is a trade-off between inference speed and numerical precision when deploying on NPU hardware. Future work may include:
349
+ - Fine-tuning quantization parameters to improve accuracy
350
+ - Exploring mixed-precision quantization strategies
351
+ - Implementing calibration with more diverse datasets
352
+
353
+ ## Advanced Usage
354
+
355
+ ### Frontend-Only Export for Graph Analysis
356
+
357
+ For debugging and graph analysis, you can export only the frontend graph without compilation:
358
+
359
+ ```sh
360
+ ENABLE_COMPILER=0 DUMP_FRONTEND_GRAPH=1 \
361
+ pulsar2 build \
362
+ --output_dir ./compiled_output_trans_body_only_frontend \
363
+ --config pulsar2_configs/config_controlnet.json \
364
+ --npu_mode NPU3 \
365
+ --input ../original_onnx/z_image_transformer_body_only_simp_slim.onnx \
366
+ --target_hardware AX650
367
+ ```
368
+
369
+ This is useful for:
370
+ - Analyzing the graph structure before compilation
371
+ - Debugging subgraph partitioning strategies
372
+ - Verifying model transformations
373
+
374
+ ### Compile from Quantized ONNX
375
+
376
+ If you already have a quantized ONNX model, you can compile it directly:
377
+
378
+ ```sh
379
+ pulsar2 build \
380
+ --input compiled_output_trans_body_only_use_calibration/quant/quant_axmodel.onnx \
381
+ --model_type QuantAxModel \
382
+ --output_dir compiled_subgraph_from_quant_onnx \
383
+ --output_name transformers.axmodel \
384
+ --config pulsar2_configs/transformers_subgraph.json \
385
+ --target_hardware AX650 \
386
+ --npu_mode NPU3
387
+ ```
388
+
389
+ ## Technical Support
390
+
391
+ If you encounter any issues or have questions about the implementation:
392
+
393
+ - **GitHub Issues**: [Create an issue](https://github.com/AXERA-TECH) for bug reports and feature requests
394
+ - **QQ Group**: 139953715 (Chinese community support)
395
+
396
+ ## License
397
+
398
+ This project is licensed under the BSD-3-Clause License. See the LICENSE file for details.
399
+
400
+ ---
401
+
402
+ **Note:** This implementation is optimized for AXERA AX650N hardware. Performance and compatibility may vary on other platforms.
403
+
VideoX-Fun/pulsar2_configs/transformers_subgraph_512x512.json ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "ONNX",
3
+ "npu_mode": "NPU3",
4
+ "quant": {
5
+ "input_configs": [
6
+ {
7
+ "tensor_name": "DEFAULT",
8
+ "calibration_dataset": "./onnx-calibration-no-controlnet/transformer.tar",
9
+ "calibration_size": 4,
10
+ "calibration_format": "NumpyObject"
11
+ }
12
+ ],
13
+ "calibration_method": "MinMax",
14
+ "precision_analysis": true,
15
+ "precision_analysis_method": "EndToEnd",
16
+ "layer_configs": [
17
+ {
18
+ "start_tensor_names": [
19
+ "DEFAULT"
20
+ ],
21
+ "end_tensor_names": [
22
+ "DEFAULT"
23
+ ],
24
+ "data_type": "U16"
25
+ }
26
+ ]
27
+ },
28
+ "input_processors": [
29
+ {
30
+ "tensor_name": "DEFAULT",
31
+ "tensor_format": "AutoColorSpace",
32
+ "tensor_layout": "NCHW"
33
+ }
34
+ ],
35
+ "compiler": {
36
+ "check": 0,
37
+ "sub_configs": [
38
+ {
39
+ "start_tensor_names": ["timestep"],
40
+ "end_tensor_names": ["/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
41
+ "check_mode": "CheckPerLayer"
42
+ },
43
+ {
44
+ "start_tensor_names": ["prompt_embeds"],
45
+ "end_tensor_names": ["/model/Slice_1_output_0"],
46
+ "check_mode": "CheckPerLayer"
47
+ },
48
+ {
49
+ "start_tensor_names": ["latent_model_input", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
50
+ "end_tensor_names": ["/model/Slice_output_0"],
51
+ "check_mode": "CheckPerLayer"
52
+ },
53
+ {
54
+ "start_tensor_names": ["/model/Slice_1_output_0", "/model/Slice_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
55
+ "end_tensor_names": ["/model/layers.0/Add_4_output_0"],
56
+ "check_mode": "CheckPerLayer"
57
+ },
58
+ {
59
+ "start_tensor_names": ["/model/layers.0/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
60
+ "end_tensor_names": ["/model/layers.1/Add_4_output_0"],
61
+ "check_mode": "CheckPerLayer"
62
+ },
63
+ {
64
+ "start_tensor_names": ["/model/layers.1/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
65
+ "end_tensor_names": ["/model/layers.2/Add_4_output_0"],
66
+ "check_mode": "CheckPerLayer"
67
+ },
68
+ {
69
+ "start_tensor_names": ["/model/layers.2/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
70
+ "end_tensor_names": ["/model/layers.3/Add_4_output_0"],
71
+ "check_mode": "CheckPerLayer"
72
+ },
73
+ {
74
+ "start_tensor_names": ["/model/layers.3/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
75
+ "end_tensor_names": ["/model/layers.4/Add_4_output_0"],
76
+ "check_mode": "CheckPerLayer"
77
+ },
78
+ {
79
+ "start_tensor_names": ["/model/layers.4/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
80
+ "end_tensor_names": ["/model/layers.5/Add_4_output_0"],
81
+ "check_mode": "CheckPerLayer"
82
+ },
83
+ {
84
+ "start_tensor_names": ["/model/layers.5/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
85
+ "end_tensor_names": ["/model/layers.6/Add_4_output_0"],
86
+ "check_mode": "CheckPerLayer"
87
+ },
88
+ {
89
+ "start_tensor_names": ["/model/layers.6/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
90
+ "end_tensor_names": ["/model/layers.7/Add_4_output_0"],
91
+ "check_mode": "CheckPerLayer"
92
+ },
93
+ {
94
+ "start_tensor_names": ["/model/layers.7/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
95
+ "end_tensor_names": ["/model/layers.8/Add_4_output_0"],
96
+ "check_mode": "CheckPerLayer"
97
+ },
98
+ {
99
+ "start_tensor_names": ["/model/layers.8/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
100
+ "end_tensor_names": ["/model/layers.9/Add_4_output_0"],
101
+ "check_mode": "CheckPerLayer"
102
+ },
103
+ {
104
+ "start_tensor_names": ["/model/layers.9/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
105
+ "end_tensor_names": ["/model/layers.10/Add_4_output_0"],
106
+ "check_mode": "CheckPerLayer"
107
+ },
108
+ {
109
+ "start_tensor_names": ["/model/layers.10/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
110
+ "end_tensor_names": ["/model/layers.11/Add_4_output_0"],
111
+ "check_mode": "CheckPerLayer"
112
+ },
113
+ {
114
+ "start_tensor_names": ["/model/layers.11/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
115
+ "end_tensor_names": ["/model/layers.12/Add_4_output_0"],
116
+ "check_mode": "CheckPerLayer"
117
+ },
118
+ {
119
+ "start_tensor_names": ["/model/layers.12/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
120
+ "end_tensor_names": ["/model/layers.13/Add_4_output_0"],
121
+ "check_mode": "CheckPerLayer"
122
+ },
123
+ {
124
+ "start_tensor_names": ["/model/layers.13/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
125
+ "end_tensor_names": ["/model/layers.14/Add_4_output_0"],
126
+ "check_mode": "CheckPerLayer"
127
+ },
128
+ {
129
+ "start_tensor_names": ["/model/layers.14/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
130
+ "end_tensor_names": ["/model/layers.15/Add_4_output_0"],
131
+ "check_mode": "CheckPerLayer"
132
+ },
133
+ {
134
+ "start_tensor_names": ["/model/layers.15/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
135
+ "end_tensor_names": ["/model/layers.16/Add_4_output_0"],
136
+ "check_mode": "CheckPerLayer"
137
+ },
138
+ {
139
+ "start_tensor_names": ["/model/layers.16/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
140
+ "end_tensor_names": ["/model/layers.17/Add_4_output_0"],
141
+ "check_mode": "CheckPerLayer"
142
+ },
143
+ {
144
+ "start_tensor_names": ["/model/layers.17/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
145
+ "end_tensor_names": ["/model/layers.18/Add_4_output_0"],
146
+ "check_mode": "CheckPerLayer"
147
+ },
148
+ {
149
+ "start_tensor_names": ["/model/layers.18/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
150
+ "end_tensor_names": ["/model/layers.19/Add_4_output_0"],
151
+ "check_mode": "CheckPerLayer"
152
+ },
153
+ {
154
+ "start_tensor_names": ["/model/layers.19/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
155
+ "end_tensor_names": ["/model/layers.20/Add_4_output_0"],
156
+ "check_mode": "CheckPerLayer"
157
+ },
158
+ {
159
+ "start_tensor_names": ["/model/layers.20/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
160
+ "end_tensor_names": ["/model/layers.21/Add_4_output_0"],
161
+ "check_mode": "CheckPerLayer"
162
+ },
163
+ {
164
+ "start_tensor_names": ["/model/layers.21/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
165
+ "end_tensor_names": ["/model/layers.22/Add_4_output_0"],
166
+ "check_mode": "CheckPerLayer"
167
+ },
168
+ {
169
+ "start_tensor_names": ["/model/layers.22/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
170
+ "end_tensor_names": ["/model/layers.23/Add_4_output_0"],
171
+ "check_mode": "CheckPerLayer"
172
+ },
173
+ {
174
+ "start_tensor_names": ["/model/layers.23/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
175
+ "end_tensor_names": ["/model/layers.24/Add_4_output_0"],
176
+ "check_mode": "CheckPerLayer"
177
+ },
178
+ {
179
+ "start_tensor_names": ["/model/layers.24/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
180
+ "end_tensor_names": ["/model/layers.25/Add_4_output_0"],
181
+ "check_mode": "CheckPerLayer"
182
+ },
183
+ {
184
+ "start_tensor_names": ["/model/layers.25/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
185
+ "end_tensor_names": ["/model/layers.26/Add_4_output_0"],
186
+ "check_mode": "CheckPerLayer"
187
+ },
188
+ {
189
+ "start_tensor_names": ["/model/layers.26/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
190
+ "end_tensor_names": ["/model/layers.27/Add_4_output_0"],
191
+ "check_mode": "CheckPerLayer"
192
+ },
193
+ {
194
+ "start_tensor_names": ["/model/layers.27/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
195
+ "end_tensor_names": ["/model/layers.28/Add_4_output_0"],
196
+ "check_mode": "CheckPerLayer"
197
+ },
198
+ {
199
+ "start_tensor_names": ["/model/layers.28/Add_4_output_0", "/model/t_embedder/mlp/mlp.2/Gemm_output_0"],
200
+ "end_tensor_names": ["/model/layers.29/Add_4_output_0"],
201
+ "check_mode": "CheckPerLayer"
202
+ }
203
+ ]
204
+ }
205
+ }
206
+
207
+