Update README.md

513544a verified 4 days ago

7.94 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	base_model:
	- Tongyi-MAI/Z-Image-Turbo
	base_model_relation: quantized
	pipeline_tag: text-to-image
	library_name: diffusers
	tags:
	- comfyui
	- quantization
	- nvfp4
	- txt2img
	---

	# Z-Image Turbo - NVFP4 Mixed-Precision

	Surgical mixed-precision quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant).

	Formats: NVFP4 (baseline) + MXFP8 (sensitive layers) + BF16 (critical layers).
	Size: 4.84 GB (-58% vs BF16).
	Inference: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200).

	Also available: [MXFP8 uniform quantization](https://huggingface.co/InsecureErasure/Z-Image-Turbo-MXFP8) (6.23 GB, near-lossless).

	![BF16 vs NFVP4](images/BF16-NVFP4-comp.png)
	![NVFP4 vs NFVP4 plus rank 32 LoRA](images/NVFP4-LoRA-comp.png)

	* Prompt:
	```
	A bust portrait of a woman in her mid-twenties with messy dark hair tied in a loose bun, wearing a worn denim jacket over a gray hoodie.
	She is leaning her elbows on a washing machine, her chin resting on her folded hands. Behind her, a row of industrial dryers against a tiled wall,
	with one dryer door hanging open. Above the dryers, a handwritten sign taped to the wall says 'OUT OF ORDER' in black marker,
	with a small smiley face drawn on it. To her left, a plastic basket overflows with unfolded clothes. To her right, a vending machine glows green,
	displaying 'SOAP $1.50' on a small digital screen. The light is cool and buzzing, like fluorescent tubes overhead. She looks tired but amused
	with a faint smirk.
	```
	* Sampler/Scheduler: Euler/Simple
	* Steps: 9
	* CFG: 1.0
	* Shift: 3.0
	* Seed: 920698660737993
	* Resolution: 1024 x 1536

	## Strategy

	Uses per-layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe) and the DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect, SVDQuant) to maximize quality-per-byte:

	- ~190 tensors → NVFP4 (4-bit E2M1): baseline for most attention + FF weights
	- ~100 tensors → MXFP8 (8-bit E4M3 + E8M0): attention outputs, gate projections (w1), mid-block adaLN
	- ~20 tensors → BF16: last QKV, late adaLN modulations, refiner outputs
	- ~110 tensors → BF16: norms, biases, embeddings (auto-excluded by `--zimage`)

	### MXFP8-protected layers

	\| Category \| Blocks \| Layers \|
	\|---\|---\|---\|
	\| Early attention outputs \| 0, 1 \| `attention.out` \|
	\| Selected QKV projections \| 10, 16, 26, 27, 28 \| `attention.qkv` \|
	\| Attention outputs \| 3, 6, 9, 11–14, 19, 20, 26–29 \| `attention.out` \|
	\| Gate projections (w1) \| 3–29 \| `feed_forward.w1` \|
	\| Mid-block modulations \| 16–21 \| `adaLN_modulation.0` \|

	### BF16-protected layers

	\| Category \| Layers \| Reason \|
	\|---\|---\|---\|
	\| Last QKV \| `layers.29.attention.qkv` \| Feeds directly into `final_layer` — no downstream compensation \|
	\| Late modulations \| `layers.(22–29).adaLN_modulation.0` \| Controls scale/shift of features near output \|
	\| Refiner attention outputs \| `context_refiner.(0\\|1).attention.out` \| Only 2 refiner blocks — outputs have outsized impact \|
	\| Selected refiner FF \| `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` \| Critical single-block projections \|
	\| Refiner up-projections \| `noise_refiner.(0\\|1).w3` \| Noise refiner w3 expands features → direct output \|

	### Refiner sub-graphs

	\| Sub-graph \| Block 0 \| Block 1 \|
	\|---\|---\|---\|
	\| `context_refiner` \| All MXFP8 (qkv, w1, w2, w3) \| qkv + w1 + w3 MXFP8, out + w2 BF16 \|
	\| `noise_refiner` \| qkv + out + w1 + w2 MXFP8, w3 BF16 \| qkv + out + w2 + w3 BF16, w1 MXFP8 \|

	## Generation

	```bash
	#!/bin/bash
	# NVFP4 baseline + MXFP8 for sensitive layers + BF16 at critical points.
	# Refiners: block 0 fully MXFP8, block 1 outputs kept in BF16.
	# Last QKV (layer 29), late adaLN (22-29), and refiner outputs in BF16.
	# All main-trunk w1 (gate) projections in MXFP8.
	convert_to_quant -i $1 \
	--nvfp4 --zimage --comfy_quant --save-quant-metadata \
	--custom-type mxfp8 \
	--custom-layers "layers\.(10\|16\|26)\.attention\.qkv\.weight\|layers\.(27\|28)\.attention\.qkv\.weight\|layers\.(0\|1)\.attention\.out\.weight\|layers\.(3\|6\|9\|11\|12\|13\|14\|19\|20\|26)\.attention\.out\.weight\|layers\.(27\|28\|29)\.attention\.out\.weight\|layers\.(3\|4\|5\|6\|7\|8\|9\|10\|11\|12\|13\|14\|15\|16\|17\|18\|19\|20\|21\|22\|23\|24\|25\|26)\.feed_forward\.w1\.weight\|layers\.(27\|28\|29)\.feed_forward\.w1\.weight\|layers\.(16\|17\|18\|19\|20\|21)\.adaLN_modulation\.0\.weight\|context_refiner\.(0\|1)\.attention\.qkv\.weight\|context_refiner\.(0\|1)\.feed_forward\.w1\.weight\|context_refiner\.(0\|1)\.feed_forward\.w2\.weight\|context_refiner\.(0\|1)\.feed_forward\.w3\.weight\|noise_refiner\.(0)\.attention\.(qkv\|out)\.weight\|noise_refiner\.(0)\.feed_forward\.(w1\|w2)\.weight\|noise_refiner\.(1)\.feed_forward\.w1\.weight" \
	--exclude-layers "layers\.(29)\.attention\.qkv\.weight\|layers\.(22\|23\|24\|25\|26)\.adaLN_modulation\.0\.weight\|layers\.(27\|28\|29)\.adaLN_modulation\.0\.weight\|context_refiner\.(0\|1)\.attention\.out\.weight\|context_refiner\.(1)\.feed_forward\.w2\.weight\|noise_refiner\.(1)\.attention\.qkv\.weight\|noise_refiner\.(1)\.attention\.out\.weight\|noise_refiner\.(1)\.feed_forward\.w2\.weight\|noise_refiner\.(0\|1)\.feed_forward\.w3\.weight" \
	--num-iter 6000 --top-p 0.35 --calib-samples 8192 \
	--scale-optimization iterative --scale-refinement-rounds 2 \
	--extract-lora --lora-rank 32 \
	-o "${1%%.safetensors}-nvfp4.safetensors"
	```

	### Included files

	\| File \| Description \|
	\|---\|---\|
	\| `z_image_turbo_nvfp4.safetensors` \| Quantized weights \|
	\| `z_image_turbo_nvfp4_lora.safetensors` \| Error-correction LoRA (rank 32) \|

	Use the LoRA with variable strength in ComfyUI for improved fidelity.

	## Requirements

	- Inference: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200)
	- Generation: `convert_to_quant >= 1.2.6`, `comfy-kitchen`

	## Comparison

	\| \| NVFP4 Mixed (this) \| MXFP8 Uniform \| Official NVFP4 \|
	\| --- \| --- \| --- \| --- \|
	\| Size \| 4.84 GB \| 6.23 GB \| 4.51 GB \|
	\| Base format \| NVFP4 (4-bit) \| MXFP8 (8-bit) \| NVFP4 (4-bit) \|
	\| Custom layers \| ~100 tensors → MXFP8 \| None \| None \|
	\| BF16 exclusions \| ~20 tensors \| 8 patterns \| Refiners fully BF16 \|
	\| Learned rounding \| ✅ 6000 iter \| ❌ --simple \| ❌ \|
	\| LoRA \| ✅ rank 32 \| ❌ \| ❌ \|
	\| Refiner block 0 \| MXFP8 \| MXFP8 \| BF16 \|
	\| Late adaLN (22–29) \| BF16 \| BF16 \| NVFP4 ⚠️ \|
	\| Last QKV (layer 29) \| BF16 \| BF16 \| NVFP4 ⚠️ \|
	\| Quantization time¹ \| ~60–90 min \| ~5–10 min \| N/A \|

	¹ Estimated on RTX 5060 (Blackwell) with `comfy-kitchen` CUDA kernels.

	## Methodology

	Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `KEEP`, `FP8`, or `NVFP4`.

	Recommendations were cross-referenced against the DiT quantization literature:

	- PTQ4DiT (NeurIPS 2024) — salient channels in QKV + FFN, last blocks most affected
	- ViDiT-Q (ICLR 2025) — metric-decoupled sensitivity: self-attention dominates visual quality
	- HTG (2025) — channel-dependent outliers, severe in later blocks
	- SemanticDialect (2026) — block-wise mixed-format validated for video DiTs
	- SVDQuant (ICLR 2025) — low-rank branch absorbs 4-bit error, validated NVFP4

	## Credits

	- Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides
	- Z-Image Turbo model by [Tongyi-MAI](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)
	- ComfyUI integration via [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen)
	- Layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe)