Update README.md

577bacb verified about 10 hours ago

4.64 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	base_model:
	- Tongyi-MAI/Z-Image-Turbo
	base_model_relation: quantized
	pipeline_tag: text-to-image
	library_name: diffusers
	tags:
	- comfyui
	- quantization
	- mxfp8
	- txt2img
	---


	# Z-Image Turbo MXFP8

	Mixed 8-bit microscaling quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant).

	* Format: MXFP8 (8-bit E4M3 + E8M0 block scales) with minimal BF16 exclusions.
	* Size: 6.23 GB (−46% vs BF16).
	* Inference: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200).

	![ZiT-MXFP8-01.png](images/ZiT-MXFP8-01.png)
	![ZiT-MXFP8-02.png](images/ZiT-MXFP8-02.png)


	### Key design decisions

	At 8-bit E4M3 with microscaling (E8M0, block=32), the quantization grid has 256 values — 16× finer than NVFP4's 4-bit grid. The DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect) and our own `quant_probe` analysis converge on the same conclusion:

	The format itself is near-lossless. Learned rounding, LoRA error correction, and scale optimization - all critical at 4-bit - provide diminishing returns here. Keeping a handful of architecturally critical layers in BF16. Everything else goes to MXFP8.

	- `--simple`: skips learned rounding. Bias correction (always active) handles systematic error. Rounding noise at 8-bit is below perceptibility.
	- No rank LoRA: the residual quantization error at 8-bit is <0.1% MSE.
	- 8 exclusion patterns: only the layers that `quant_probe` and the literature flag as critical.

	BF16-excluded layer

	\| Category \| Layers \| Reason \|
	\|---\|---\|---\|
	\| Last QKV \| `layers.29.attention.qkv` \| Feeds directly into `final_layer` — no downstream compensation \|
	\| Late modulations \| `layers.(22–29).adaLN_modulation.0` \| Controls scale/shift of features near output \|
	\| Refiner attention outputs \| `context_refiner.(0\\|1).attention.out` \| Only 2 refiner blocks — outputs have outsized impact \|
	\| Selected refiner FF \| `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` \| Critical single-block projections \|
	\| Refiner up-projections \| `noise_refiner.(0\\|1).w3` \| Noise refiner w3 expands features → direct output \|

	All other weight tensors (attention projections, feed-forward layers, early/mid-block modulations, refiner block 0) use MXFP8.

	## Generation

	```bash
	#!/bin/bash
	# MXFP8 8-bit microscaling - near-lossless, no learned rounding needed.
	# Late adaLN (22-29), last QKV (layer 29), and refiner outputs in BF16.
	convert_to_quant -i $1 \
	--mxfp8 --zimage --comfy_quant --save-quant-metadata \
	--simple --low-memory \
	--calib-samples 8192 \
	--exclude-layers "layers\.(29)\.attention\.qkv\.weight\|layers\.(22\|23\|24\|25\|26)\.adaLN_modulation\.0\.weight\|layers\.(27\|28\|29)\.adaLN_modulation\.0\.weight\|context_refiner\.(0\|1)\.attention\.out\.weight\|context_refiner\.(1)\.feed_forward\.w2\.weight\|noise_refiner\.(1)\.attention\.qkv\.weight\|noise_refiner\.(1)\.attention\.out\.weight\|noise_refiner\.(1)\.feed_forward\.w2\.weight\|noise_refiner\.(0\|1)\.feed_forward\.w3\.weight" \
	-o "${1%%.safetensors}-mxfp8.safetensors"
	```

	## Requirements

	- Inference: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx)
	- Generation: `convert_to_quant >= 1.2.6`, `comfy-kitchen`

	## Methodology

	Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `KEEP`, `FP8`, or `NVFP4`.

	Recommendations were cross-referenced against the DiT quantization literature:

	- PTQ4DiT (NeurIPS 2024) — salient channels in QKV + FFN, last blocks most affected
	- ViDiT-Q (ICLR 2025) — metric-decoupled sensitivity: self-attention dominates visual quality
	- HTG (2025) — channel-dependent outliers, severe in later blocks
	- SemanticDialect (2026) — block-wise mixed-format validated for video DiTs
	- SVDQuant (ICLR 2025) — low-rank branch absorbs 4-bit error, validated NVFP4

	## Credits

	- Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides
	- Z-Image Turbo model by [Tongyi-MAI](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)
	- ComfyUI integration via [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen)
	- Layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe)