--- license: apache-2.0 language: - en - zh base_model: - Tongyi-MAI/Z-Image-Turbo base_model_relation: quantized pipeline_tag: text-to-image library_name: diffusers tags: - comfyui - quantization - mxfp8 - txt2img --- # Z-Image Turbo MXFP8 Mixed 8-bit microscaling quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant). * **Format**: MXFP8 (8-bit E4M3 + E8M0 block scales) with minimal BF16 exclusions. * **Size**: 6.23 GB (−46% vs BF16). * **Inference**: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200). ![ZiT-MXFP8-01.png](images/ZiT-MXFP8-01.png) ![ZiT-MXFP8-02.png](images/ZiT-MXFP8-02.png) ### Key design decisions At 8-bit E4M3 with microscaling (E8M0, block=32), the quantization grid has 256 values — 16× finer than NVFP4's 4-bit grid. The DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect) and our own `quant_probe` analysis converge on the same conclusion: The format itself is near-lossless. Learned rounding, LoRA error correction, and scale optimization - all critical at 4-bit - provide diminishing returns here. Keeping a handful of architecturally critical layers in BF16. Everything else goes to MXFP8. - **`--simple`**: skips learned rounding. Bias correction (always active) handles systematic error. Rounding noise at 8-bit is below perceptibility. - **No rank LoRA**: the residual quantization error at 8-bit is <0.1% MSE. - **8 exclusion patterns**: only the layers that `quant_probe` and the literature flag as critical. **BF16-excluded layer** | Category | Layers | Reason | |---|---|---| | Last QKV | `layers.29.attention.qkv` | Feeds directly into `final_layer` — no downstream compensation | | Late modulations | `layers.(22–29).adaLN_modulation.0` | Controls scale/shift of features near output | | Refiner attention outputs | `context_refiner.(0\|1).attention.out` | Only 2 refiner blocks — outputs have outsized impact | | Selected refiner FF | `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` | Critical single-block projections | | Refiner up-projections | `noise_refiner.(0\|1).w3` | Noise refiner w3 expands features → direct output | All other weight tensors (attention projections, feed-forward layers, early/mid-block modulations, refiner block 0) use MXFP8. ## Generation ```bash #!/bin/bash # MXFP8 8-bit microscaling - near-lossless, no learned rounding needed. # Late adaLN (22-29), last QKV (layer 29), and refiner outputs in BF16. convert_to_quant -i $1 \ --mxfp8 --zimage --comfy_quant --save-quant-metadata \ --simple --low-memory \ --calib-samples 8192 \ --exclude-layers "layers\.(29)\.attention\.qkv\.weight|layers\.(22|23|24|25|26)\.adaLN_modulation\.0\.weight|layers\.(27|28|29)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.out\.weight|context_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(1)\.attention\.qkv\.weight|noise_refiner\.(1)\.attention\.out\.weight|noise_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(0|1)\.feed_forward\.w3\.weight" \ -o "${1%%.safetensors}-mxfp8.safetensors" ``` ## Requirements - **Inference**: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx) - **Generation**: `convert_to_quant >= 1.2.6`, `comfy-kitchen` ## Methodology Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `*KEEP*`, `FP8`, or `NVFP4`. Recommendations were cross-referenced against the DiT quantization literature: - **PTQ4DiT** (NeurIPS 2024) — salient channels in QKV + FFN, last blocks most affected - **ViDiT-Q** (ICLR 2025) — metric-decoupled sensitivity: self-attention dominates visual quality - **HTG** (2025) — channel-dependent outliers, severe in later blocks - **SemanticDialect** (2026) — block-wise mixed-format validated for video DiTs - **SVDQuant** (ICLR 2025) — low-rank branch absorbs 4-bit error, validated NVFP4 ## Credits - Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides - Z-Image Turbo model by [Tongyi-MAI](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) - ComfyUI integration via [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen) - Layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe)