File size: 7,935 Bytes
72048db
 
a99f414
 
 
 
 
 
 
 
 
 
 
 
 
72048db
a99f414
a3b00eb
a99f414
 
 
 
a3b00eb
a99f414
 
a3b00eb
a99f414
a3b00eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a99f414
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d70c115
a99f414
 
 
 
 
 
 
 
 
513544a
 
a99f414
513544a
a99f414
 
 
d70c115
a99f414
 
 
 
513544a
 
 
 
 
 
 
 
 
 
 
 
a99f414
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d70c115
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: apache-2.0
language:
  - en
  - zh
base_model:
  - Tongyi-MAI/Z-Image-Turbo
base_model_relation: quantized
pipeline_tag: text-to-image
library_name: diffusers
tags:
  - comfyui
  - quantization
  - nvfp4
  - txt2img
---

# Z-Image Turbo - NVFP4 Mixed-Precision

Surgical mixed-precision quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant).

**Formats**: NVFP4 (baseline) + MXFP8 (sensitive layers) + BF16 (critical layers).  
**Size**: 4.84 GB (-58% vs BF16).  
**Inference**: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200).

Also available: [MXFP8 uniform quantization](https://huggingface.co/InsecureErasure/Z-Image-Turbo-MXFP8) (6.23 GB, near-lossless).

![BF16 vs NFVP4](images/BF16-NVFP4-comp.png)
![NVFP4 vs NFVP4 plus rank 32 LoRA](images/NVFP4-LoRA-comp.png)

* **Prompt:**
```
A bust portrait of a woman in her mid-twenties with messy dark hair tied in a loose bun, wearing a worn denim jacket over a gray hoodie.
She is leaning her elbows on a washing machine, her chin resting on her folded hands. Behind her, a row of industrial dryers against a tiled wall,
with one dryer door hanging open. Above the dryers, a handwritten sign taped to the wall says 'OUT OF ORDER' in black marker,
with a small smiley face drawn on it. To her left, a plastic basket overflows with unfolded clothes. To her right, a vending machine glows green,
displaying 'SOAP $1.50' on a small digital screen. The light is cool and buzzing, like fluorescent tubes overhead. She looks tired but amused
with a faint smirk.
```
* **Sampler/Scheduler:** Euler/Simple
* **Steps:** 9
* **CFG:** 1.0
* **Shift:** 3.0
* **Seed:** 920698660737993
* **Resolution:** 1024 x 1536

## Strategy

Uses per-layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe) and the DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect, SVDQuant) to maximize quality-per-byte:

- **~190 tensors β†’ NVFP4** (4-bit E2M1): baseline for most attention + FF weights
- **~100 tensors β†’ MXFP8** (8-bit E4M3 + E8M0): attention outputs, gate projections (w1), mid-block adaLN
- **~20 tensors β†’ BF16**: last QKV, late adaLN modulations, refiner outputs
- **~110 tensors β†’ BF16**: norms, biases, embeddings (auto-excluded by `--zimage`)

### MXFP8-protected layers

| Category | Blocks | Layers |
|---|---|---|
| Early attention outputs | 0, 1 | `attention.out` |
| Selected QKV projections | 10, 16, 26, 27, 28 | `attention.qkv` |
| Attention outputs | 3, 6, 9, 11–14, 19, 20, 26–29 | `attention.out` |
| Gate projections (w1) | 3–29 | `feed_forward.w1` |
| Mid-block modulations | 16–21 | `adaLN_modulation.0` |

### BF16-protected layers

| Category | Layers | Reason |
|---|---|---|
| Last QKV | `layers.29.attention.qkv` | Feeds directly into `final_layer` β€” no downstream compensation |
| Late modulations | `layers.(22–29).adaLN_modulation.0` | Controls scale/shift of features near output |
| Refiner attention outputs | `context_refiner.(0\|1).attention.out` | Only 2 refiner blocks β€” outputs have outsized impact |
| Selected refiner FF | `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` | Critical single-block projections |
| Refiner up-projections | `noise_refiner.(0\|1).w3` | Noise refiner w3 expands features β†’ direct output |

### Refiner sub-graphs

| Sub-graph | Block 0 | Block 1 |
|---|---|---|
| `context_refiner` | All MXFP8 (qkv, w1, w2, w3) | qkv + w1 + w3 MXFP8, out + w2 BF16 |
| `noise_refiner` | qkv + out + w1 + w2 MXFP8, w3 BF16 | qkv + out + w2 + w3 BF16, w1 MXFP8 |

## Generation

```bash
#!/bin/bash
# NVFP4 baseline + MXFP8 for sensitive layers + BF16 at critical points.
# Refiners: block 0 fully MXFP8, block 1 outputs kept in BF16.
# Last QKV (layer 29), late adaLN (22-29), and refiner outputs in BF16.
# All main-trunk w1 (gate) projections in MXFP8.
convert_to_quant -i $1 \
  --nvfp4 --zimage --comfy_quant --save-quant-metadata \
  --custom-type mxfp8 \
  --custom-layers "layers\.(10|16|26)\.attention\.qkv\.weight|layers\.(27|28)\.attention\.qkv\.weight|layers\.(0|1)\.attention\.out\.weight|layers\.(3|6|9|11|12|13|14|19|20|26)\.attention\.out\.weight|layers\.(27|28|29)\.attention\.out\.weight|layers\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26)\.feed_forward\.w1\.weight|layers\.(27|28|29)\.feed_forward\.w1\.weight|layers\.(16|17|18|19|20|21)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.qkv\.weight|context_refiner\.(0|1)\.feed_forward\.w1\.weight|context_refiner\.(0|1)\.feed_forward\.w2\.weight|context_refiner\.(0|1)\.feed_forward\.w3\.weight|noise_refiner\.(0)\.attention\.(qkv|out)\.weight|noise_refiner\.(0)\.feed_forward\.(w1|w2)\.weight|noise_refiner\.(1)\.feed_forward\.w1\.weight" \
  --exclude-layers "layers\.(29)\.attention\.qkv\.weight|layers\.(22|23|24|25|26)\.adaLN_modulation\.0\.weight|layers\.(27|28|29)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.out\.weight|context_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(1)\.attention\.qkv\.weight|noise_refiner\.(1)\.attention\.out\.weight|noise_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(0|1)\.feed_forward\.w3\.weight" \
  --num-iter 6000 --top-p 0.35 --calib-samples 8192 \
  --scale-optimization iterative --scale-refinement-rounds 2 \
  --extract-lora --lora-rank 32 \
  -o "${1%%.safetensors}-nvfp4.safetensors"
```

### Included files

| File | Description |
|---|---|
| `z_image_turbo_nvfp4.safetensors` | Quantized weights |
| `z_image_turbo_nvfp4_lora.safetensors` | Error-correction LoRA (rank 32) |

Use the LoRA with variable strength in ComfyUI for improved fidelity.

## Requirements

- **Inference**: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200)
- **Generation**: `convert_to_quant >= 1.2.6`, `comfy-kitchen`

## Comparison

| | NVFP4 Mixed (this) | MXFP8 Uniform | Official NVFP4 |
| --- | --- | --- | --- |
| Size | 4.84 GB | 6.23 GB | 4.51 GB |
| Base format | NVFP4 (4-bit) | MXFP8 (8-bit) | NVFP4 (4-bit) |
| Custom layers | ~100 tensors β†’ MXFP8 | None | None |
| BF16 exclusions | ~20 tensors | 8 patterns | Refiners fully BF16 |
| Learned rounding | βœ… 6000 iter | ❌ --simple | ❌ |
| LoRA | βœ… rank 32 | ❌ | ❌ |
| Refiner block 0 | MXFP8 | MXFP8 | BF16 |
| Late adaLN (22–29) | BF16 | BF16 | NVFP4 ⚠️ |
| Last QKV (layer 29) | BF16 | BF16 | NVFP4 ⚠️ |
| Quantization timeΒΉ | ~60–90 min | ~5–10 min | N/A |

ΒΉ Estimated on RTX 5060 (Blackwell) with `comfy-kitchen` CUDA kernels.

## Methodology

Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `*KEEP*`, `FP8`, or `NVFP4`.

Recommendations were cross-referenced against the DiT quantization literature:

- **PTQ4DiT** (NeurIPS 2024) β€” salient channels in QKV + FFN, last blocks most affected
- **ViDiT-Q** (ICLR 2025) β€” metric-decoupled sensitivity: self-attention dominates visual quality
- **HTG** (2025) β€” channel-dependent outliers, severe in later blocks
- **SemanticDialect** (2026) β€” block-wise mixed-format validated for video DiTs
- **SVDQuant** (ICLR 2025) β€” low-rank branch absorbs 4-bit error, validated NVFP4

## Credits

- Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides
- Z-Image Turbo model by [Tongyi-MAI](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)
- ComfyUI integration via [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen)
- Layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe)