Instructions to use InsecureErasure/Z-Image-Turbo-MXFP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use InsecureErasure/Z-Image-Turbo-MXFP8 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("InsecureErasure/Z-Image-Turbo-MXFP8", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
File size: 4,636 Bytes
6cdfbff 44c5ed5 1d3f216 44c5ed5 f3a4f8a 1d3f216 577bacb 1d3f216 577bacb 1d3f216 577bacb 1d3f216 b31fcc4 1d3f216 577bacb 1d3f216 577bacb 1d3f216 577bacb 1d3f216 577bacb 1d3f216 577bacb 1d3f216 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | ---
license: apache-2.0
language:
- en
- zh
base_model:
- Tongyi-MAI/Z-Image-Turbo
base_model_relation: quantized
pipeline_tag: text-to-image
library_name: diffusers
tags:
- comfyui
- quantization
- mxfp8
- txt2img
---
# Z-Image Turbo MXFP8
Mixed 8-bit microscaling quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant).
* **Format**: MXFP8 (8-bit E4M3 + E8M0 block scales) with minimal BF16 exclusions.
* **Size**: 6.23 GB (β46% vs BF16).
* **Inference**: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200).


### Key design decisions
At 8-bit E4M3 with microscaling (E8M0, block=32), the quantization grid has 256 values β 16Γ finer than NVFP4's 4-bit grid. The DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect) and our own `quant_probe` analysis converge on the same conclusion:
The format itself is near-lossless. Learned rounding, LoRA error correction, and scale optimization - all critical at 4-bit - provide diminishing returns here. Keeping a handful of architecturally critical layers in BF16. Everything else goes to MXFP8.
- **`--simple`**: skips learned rounding. Bias correction (always active) handles systematic error. Rounding noise at 8-bit is below perceptibility.
- **No rank LoRA**: the residual quantization error at 8-bit is <0.1% MSE.
- **8 exclusion patterns**: only the layers that `quant_probe` and the literature flag as critical.
**BF16-excluded layer**
| Category | Layers | Reason |
|---|---|---|
| Last QKV | `layers.29.attention.qkv` | Feeds directly into `final_layer` β no downstream compensation |
| Late modulations | `layers.(22β29).adaLN_modulation.0` | Controls scale/shift of features near output |
| Refiner attention outputs | `context_refiner.(0\|1).attention.out` | Only 2 refiner blocks β outputs have outsized impact |
| Selected refiner FF | `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` | Critical single-block projections |
| Refiner up-projections | `noise_refiner.(0\|1).w3` | Noise refiner w3 expands features β direct output |
All other weight tensors (attention projections, feed-forward layers, early/mid-block modulations, refiner block 0) use MXFP8.
## Generation
```bash
#!/bin/bash
# MXFP8 8-bit microscaling - near-lossless, no learned rounding needed.
# Late adaLN (22-29), last QKV (layer 29), and refiner outputs in BF16.
convert_to_quant -i $1 \
--mxfp8 --zimage --comfy_quant --save-quant-metadata \
--simple --low-memory \
--calib-samples 8192 \
--exclude-layers "layers\.(29)\.attention\.qkv\.weight|layers\.(22|23|24|25|26)\.adaLN_modulation\.0\.weight|layers\.(27|28|29)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.out\.weight|context_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(1)\.attention\.qkv\.weight|noise_refiner\.(1)\.attention\.out\.weight|noise_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(0|1)\.feed_forward\.w3\.weight" \
-o "${1%%.safetensors}-mxfp8.safetensors"
```
## Requirements
- **Inference**: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx)
- **Generation**: `convert_to_quant >= 1.2.6`, `comfy-kitchen`
## Methodology
Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `*KEEP*`, `FP8`, or `NVFP4`.
Recommendations were cross-referenced against the DiT quantization literature:
- **PTQ4DiT** (NeurIPS 2024) β salient channels in QKV + FFN, last blocks most affected
- **ViDiT-Q** (ICLR 2025) β metric-decoupled sensitivity: self-attention dominates visual quality
- **HTG** (2025) β channel-dependent outliers, severe in later blocks
- **SemanticDialect** (2026) β block-wise mixed-format validated for video DiTs
- **SVDQuant** (ICLR 2025) β low-rank branch absorbs 4-bit error, validated NVFP4
## Credits
- Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides
- Z-Image Turbo model by [Tongyi-MAI](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)
- ComfyUI integration via [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen)
- Layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe) |