InsecureErasure
/

Z-Image-Turbo-MXFP8

@@ -16,36 +16,29 @@ tags:
 ---
-# Z-Image Turbo — MXFP8 Uniform
-Uniform 8-bit microscaling quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant).
-**Format**: MXFP8 (8-bit E4M3 + E8M0 block scales) with minimal BF16 exclusions.
-**Size**: 6.23 GB (−46% vs BF16).
-**Inference**: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200).
 ![ZiT-MXFP8-01.png](images/ZiT-MXFP8-01.png)
 ![ZiT-MXFP8-02.png](images/ZiT-MXFP8-02.png)
----
-## Why MXFP8?
 At 8-bit E4M3 with microscaling (E8M0, block=32), the quantization grid has 256 values — 16× finer than NVFP4's 4-bit grid. The DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect) and our own `quant_probe` analysis converge on the same conclusion:
-**At 8-bit weight-only, per-layer format selection is overkill.** The format itself is near-lossless. Learned rounding, LoRA error correction, and scale optimization - all critical at 4-bit - provide diminishing returns here.
-What _does_ matter: keeping a handful of architecturally critical layers in BF16. Everything else goes to MXFP8.
-### Key design decisions
 - **`--simple`**: skips learned rounding. Bias correction (always active) handles systematic error. Rounding noise at 8-bit is below perceptibility.
-- **No LoRA**: the residual quantization error at 8-bit is <0.1% MSE.
 - **8 exclusion patterns**: only the layers that `quant_probe` and the literature flag as critical.
----
-## BF16-excluded layers (8 patterns)
 | Category | Layers | Reason |
 |---|---|---|
@@ -55,11 +48,7 @@ What _does_ matter: keeping a handful of architecturally critical layers in BF16
 | Selected refiner FF | `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` | Critical single-block projections |
 | Refiner up-projections | `noise_refiner.(0\|1).w3` | Noise refiner w3 expands features → direct output |
-### Everything else: MXFP8
-All other weight tensors — attention projections, feed-forward layers, early/mid-block modulations, refiner block 0 — use MXFP8 uniformly.
----
 ## Generation
@@ -75,15 +64,11 @@ convert_to_quant -i $1 \
   -o "${1%%.safetensors}-mxfp8.safetensors"
 ```
----
 ## Requirements
 - **Inference**: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx)
 - **Generation**: `convert_to_quant >= 1.2.6`, `comfy-kitchen`
----
 ## Methodology
 Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `*KEEP*`, `FP8`, or `NVFP4`.
@@ -96,10 +81,6 @@ Recommendations were cross-referenced against the DiT quantization literature:
 - **SemanticDialect** (2026) — block-wise mixed-format validated for video DiTs
 - **SVDQuant** (ICLR 2025) — low-rank branch absorbs 4-bit error, validated NVFP4
-The conclusion: at 8-bit weight-only, the format itself is sufficient. Surgical precision matters at 4-bit, not at 8-bit.
----
 ## Credits
 - Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides

 ---
+# Z-Image Turbo MXFP8
+Mixed 8-bit microscaling quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant).
+* **Format**: MXFP8 (8-bit E4M3 + E8M0 block scales) with minimal BF16 exclusions.
+* **Size**: 6.23 GB (−46% vs BF16).
+* **Inference**: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200).
 ![ZiT-MXFP8-01.png](images/ZiT-MXFP8-01.png)
 ![ZiT-MXFP8-02.png](images/ZiT-MXFP8-02.png)
+### Key design decisions
 At 8-bit E4M3 with microscaling (E8M0, block=32), the quantization grid has 256 values — 16× finer than NVFP4's 4-bit grid. The DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect) and our own `quant_probe` analysis converge on the same conclusion:
+The format itself is near-lossless. Learned rounding, LoRA error correction, and scale optimization - all critical at 4-bit - provide diminishing returns here. Keeping a handful of architecturally critical layers in BF16. Everything else goes to MXFP8.
 - **`--simple`**: skips learned rounding. Bias correction (always active) handles systematic error. Rounding noise at 8-bit is below perceptibility.
+- **No rank LoRA**: the residual quantization error at 8-bit is <0.1% MSE.
 - **8 exclusion patterns**: only the layers that `quant_probe` and the literature flag as critical.
+**BF16-excluded layer**
 | Category | Layers | Reason |
 |---|---|---|
 | Selected refiner FF | `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` | Critical single-block projections |
 | Refiner up-projections | `noise_refiner.(0\|1).w3` | Noise refiner w3 expands features → direct output |
+All other weight tensors (attention projections, feed-forward layers, early/mid-block modulations, refiner block 0) use MXFP8.
 ## Generation
   -o "${1%%.safetensors}-mxfp8.safetensors"
 ```
 ## Requirements
 - **Inference**: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx)
 - **Generation**: `convert_to_quant >= 1.2.6`, `comfy-kitchen`
 ## Methodology
 Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `*KEEP*`, `FP8`, or `NVFP4`.
 - **SemanticDialect** (2026) — block-wise mixed-format validated for video DiTs
 - **SVDQuant** (ICLR 2025) — low-rank branch absorbs 4-bit error, validated NVFP4
 ## Credits
 - Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides