InsecureErasure commited on
Commit
a99f414
Β·
verified Β·
1 Parent(s): 72048db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +140 -0
README.md CHANGED
@@ -1,3 +1,143 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ base_model:
7
+ - Tongyi-MAI/Z-Image-Turbo
8
+ base_model_relation: quantized
9
+ pipeline_tag: text-to-image
10
+ library_name: diffusers
11
+ tags:
12
+ - comfyui
13
+ - quantization
14
+ - nvfp4
15
+ - txt2img
16
  ---
17
+
18
+ # Z-Image Turbo β€” NVFP4 Mixed-Precision
19
+
20
+ Surgical mixed-precision quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant).
21
+
22
+ **Formats**: NVFP4 (baseline) + MXFP8 (sensitive layers) + BF16 (critical layers).
23
+ **Size**: 4.84 GB (βˆ’58% vs BF16).
24
+ **Inference**: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200).
25
+
26
+ Also available: [MXFP8 uniform quantization](https://huggingface.co/InsecureErasure/Z-Image-Turbo-MXFP8) (6.23 GB, near-lossless, simpler).
27
+
28
+ ---
29
+
30
+ ## Strategy
31
+
32
+ Uses per-layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe) and the DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect, SVDQuant) to maximize quality-per-byte:
33
+
34
+ - **~190 tensors β†’ NVFP4** (4-bit E2M1): baseline for most attention + FF weights
35
+ - **~100 tensors β†’ MXFP8** (8-bit E4M3 + E8M0): attention outputs, gate projections (w1), mid-block adaLN
36
+ - **~20 tensors β†’ BF16**: last QKV, late adaLN modulations, refiner outputs
37
+ - **~110 tensors β†’ BF16**: norms, biases, embeddings (auto-excluded by `--zimage`)
38
+
39
+ ### MXFP8-protected layers
40
+
41
+ | Category | Blocks | Layers |
42
+ |---|---|---|
43
+ | Early attention outputs | 0, 1 | `attention.out` |
44
+ | Selected QKV projections | 10, 16, 26, 27, 28 | `attention.qkv` |
45
+ | Attention outputs | 3, 6, 9, 11–14, 19, 20, 26–29 | `attention.out` |
46
+ | Gate projections (w1) | 3–29 | `feed_forward.w1` |
47
+ | Mid-block modulations | 16–21 | `adaLN_modulation.0` |
48
+
49
+ ### BF16-protected layers
50
+
51
+ | Category | Layers | Reason |
52
+ |---|---|---|
53
+ | Last QKV | `layers.29.attention.qkv` | Feeds directly into `final_layer` β€” no downstream compensation |
54
+ | Late modulations | `layers.(22–29).adaLN_modulation.0` | Controls scale/shift of features near output |
55
+ | Refiner attention outputs | `context_refiner.(0\|1).attention.out` | Only 2 refiner blocks β€” outputs have outsized impact |
56
+ | Selected refiner FF | `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` | Critical single-block projections |
57
+ | Refiner up-projections | `noise_refiner.(0\|1).w3` | Noise refiner w3 expands features β†’ direct output |
58
+
59
+ ### Refiner sub-graphs
60
+
61
+ | Sub-graph | Block 0 | Block 1 |
62
+ |---|---|---|
63
+ | `context_refiner` | All MXFP8 (qkv, w1, w2, w3) | qkv + w1 + w3 MXFP8, out + w2 BF16 |
64
+ | `noise_refiner` | qkv + out + w1 + w2 MXFP8, w3 BF16 | qkv + out + w2 + w3 BF16, w1 MXFP8 |
65
+
66
+ ---
67
+
68
+ ## Generation
69
+
70
+ ```bash
71
+ #!/bin/bash
72
+ # NVFP4 baseline + MXFP8 for sensitive layers + BF16 at critical points.
73
+ # Refiners: block 0 fully MXFP8, block 1 outputs kept in BF16.
74
+ # Last QKV (layer 29), late adaLN (22-29), and refiner outputs in BF16.
75
+ # All main-trunk w1 (gate) projections in MXFP8.
76
+ convert_to_quant -i $1 \
77
+ --nvfp4 --zimage --comfy_quant --save-quant-metadata \
78
+ --custom-type mxfp8 \
79
+ --custom-layers "layers\.(10|16|26)\.attention\.qkv\.weight|layers\.(27|28)\.attention\.qkv\.weight|layers\.(0|1)\.attention\.out\.weight|layers\.(3|6|9|11|12|13|14|19|20|26)\.attention\.out\.weight|layers\.(27|28|29)\.attention\.out\.weight|layers\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26)\.feed_forward\.w1\.weight|layers\.(27|28|29)\.feed_forward\.w1\.weight|layers\.(16|17|18|19|20|21)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.qkv\.weight|context_refiner\.(0|1)\.feed_forward\.w1\.weight|context_refiner\.(0|1)\.feed_forward\.w2\.weight|context_refiner\.(0|1)\.feed_forward\.w3\.weight|noise_refiner\.(0)\.attention\.(qkv|out)\.weight|noise_refiner\.(0)\.feed_forward\.(w1|w2)\.weight|noise_refiner\.(1)\.feed_forward\.w1\.weight" \
80
+ --exclude-layers "layers\.(29)\.attention\.qkv\.weight|layers\.(22|23|24|25|26)\.adaLN_modulation\.0\.weight|layers\.(27|28|29)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.out\.weight|context_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(1)\.attention\.qkv\.weight|noise_refiner\.(1)\.attention\.out\.weight|noise_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(0|1)\.feed_forward\.w3\.weight" \
81
+ --num-iter 6000 --top-p 0.35 --calib-samples 8192 --manual-seed 42 \
82
+ --scale-optimization iterative --scale-refinement-rounds 2 \
83
+ --extract-lora --lora-rank 32 \
84
+ -o "${1%%.safetensors}-nvfp4.safetensors"
85
+ ```
86
+
87
+ ### Included files
88
+
89
+ | File | Description |
90
+ |---|---|
91
+ | `z_image_turbo_nvfp4_mixed.safetensors` | Quantized weights |
92
+ | `z_image_turbo_nvfp4_mixed_lora.safetensors` | Error-correction LoRA (rank 32) |
93
+
94
+ Use the LoRA at **1.5–2.0** strength in ComfyUI for maximum fidelity.
95
+
96
+ ---
97
+
98
+ ## Requirements
99
+
100
+ - **Inference**: CUDA 13.0+, PyTorch 2.8+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200)
101
+ - **Generation**: `convert_to_quant >= 1.2.6`, `comfy-kitchen`
102
+
103
+ ---
104
+
105
+ ## Comparison
106
+
107
+ | | NVFP4 Mixed (this) | [MXFP8 Uniform](https://huggingface.co/InsecureErasure/Z-Image-Turbo-MXFP8) | [Official NVFP4](https://huggingface.co/Comfy-Org/z_image_turbo) |
108
+ |---|---|---|---|---:|
109
+ | **Size** | 4.84 GB | 6.23 GB | 4.51 GB |
110
+ | **Base format** | NVFP4 (4-bit) | MXFP8 (8-bit) | NVFP4 (4-bit) |
111
+ | **Custom layers** | ~100 tensors β†’ MXFP8 | None | None |
112
+ | **BF16 exclusions** | ~20 surgical | 8 patterns | Refiners fully BF16 |
113
+ | **Learned rounding** | βœ… 6000 iter | ❌ `--simple` | ❌ |
114
+ | **LoRA** | βœ… rank 32 | ❌ | ❌ |
115
+ | **Refiner block 0** | MXFP8 | MXFP8 | BF16 |
116
+ | **Late adaLN (22–29)** | BF16 | BF16 | NVFP4 ⚠️ |
117
+ | **Last QKV (layer 29)** | BF16 | BF16 | NVFP4 ⚠️ |
118
+ | **Quantization timeΒΉ** | ~60–90 min | ~5–10 min | N/A |
119
+
120
+ ΒΉ Estimated on RTX 5060 (Blackwell) with `comfy-kitchen` CUDA kernels.
121
+
122
+ ---
123
+
124
+ ## Methodology
125
+
126
+ Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `*KEEP*`, `FP8`, or `NVFP4`.
127
+
128
+ Recommendations were cross-referenced against the DiT quantization literature:
129
+
130
+ - **PTQ4DiT** (NeurIPS 2024) β€” salient channels in QKV + FFN, last blocks most affected
131
+ - **ViDiT-Q** (ICLR 2025) β€” metric-decoupled sensitivity: self-attention dominates visual quality
132
+ - **HTG** (2025) β€” channel-dependent outliers, severe in later blocks
133
+ - **SemanticDialect** (2026) β€” block-wise mixed-format validated for video DiTs
134
+ - **SVDQuant** (ICLR 2025) β€” low-rank branch absorbs 4-bit error, validated NVFP4
135
+
136
+ ---
137
+
138
+ ## Credits
139
+
140
+ - Layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe)
141
+ - Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides
142
+ - Z-Image Turbo model by [Tongyi-MAI](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)
143
+ - ComfyUI integration via [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen)