Instructions to use InsecureErasure/Z-Image-Turbo-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use InsecureErasure/Z-Image-Turbo-NVFP4 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("InsecureErasure/Z-Image-Turbo-NVFP4", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
File size: 7,935 Bytes
72048db a99f414 72048db a99f414 a3b00eb a99f414 a3b00eb a99f414 a3b00eb a99f414 a3b00eb a99f414 d70c115 a99f414 513544a a99f414 513544a a99f414 d70c115 a99f414 513544a a99f414 d70c115 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | ---
license: apache-2.0
language:
- en
- zh
base_model:
- Tongyi-MAI/Z-Image-Turbo
base_model_relation: quantized
pipeline_tag: text-to-image
library_name: diffusers
tags:
- comfyui
- quantization
- nvfp4
- txt2img
---
# Z-Image Turbo - NVFP4 Mixed-Precision
Surgical mixed-precision quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant).
**Formats**: NVFP4 (baseline) + MXFP8 (sensitive layers) + BF16 (critical layers).
**Size**: 4.84 GB (-58% vs BF16).
**Inference**: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200).
Also available: [MXFP8 uniform quantization](https://huggingface.co/InsecureErasure/Z-Image-Turbo-MXFP8) (6.23 GB, near-lossless).


* **Prompt:**
```
A bust portrait of a woman in her mid-twenties with messy dark hair tied in a loose bun, wearing a worn denim jacket over a gray hoodie.
She is leaning her elbows on a washing machine, her chin resting on her folded hands. Behind her, a row of industrial dryers against a tiled wall,
with one dryer door hanging open. Above the dryers, a handwritten sign taped to the wall says 'OUT OF ORDER' in black marker,
with a small smiley face drawn on it. To her left, a plastic basket overflows with unfolded clothes. To her right, a vending machine glows green,
displaying 'SOAP $1.50' on a small digital screen. The light is cool and buzzing, like fluorescent tubes overhead. She looks tired but amused
with a faint smirk.
```
* **Sampler/Scheduler:** Euler/Simple
* **Steps:** 9
* **CFG:** 1.0
* **Shift:** 3.0
* **Seed:** 920698660737993
* **Resolution:** 1024 x 1536
## Strategy
Uses per-layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe) and the DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect, SVDQuant) to maximize quality-per-byte:
- **~190 tensors β NVFP4** (4-bit E2M1): baseline for most attention + FF weights
- **~100 tensors β MXFP8** (8-bit E4M3 + E8M0): attention outputs, gate projections (w1), mid-block adaLN
- **~20 tensors β BF16**: last QKV, late adaLN modulations, refiner outputs
- **~110 tensors β BF16**: norms, biases, embeddings (auto-excluded by `--zimage`)
### MXFP8-protected layers
| Category | Blocks | Layers |
|---|---|---|
| Early attention outputs | 0, 1 | `attention.out` |
| Selected QKV projections | 10, 16, 26, 27, 28 | `attention.qkv` |
| Attention outputs | 3, 6, 9, 11β14, 19, 20, 26β29 | `attention.out` |
| Gate projections (w1) | 3β29 | `feed_forward.w1` |
| Mid-block modulations | 16β21 | `adaLN_modulation.0` |
### BF16-protected layers
| Category | Layers | Reason |
|---|---|---|
| Last QKV | `layers.29.attention.qkv` | Feeds directly into `final_layer` β no downstream compensation |
| Late modulations | `layers.(22β29).adaLN_modulation.0` | Controls scale/shift of features near output |
| Refiner attention outputs | `context_refiner.(0\|1).attention.out` | Only 2 refiner blocks β outputs have outsized impact |
| Selected refiner FF | `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` | Critical single-block projections |
| Refiner up-projections | `noise_refiner.(0\|1).w3` | Noise refiner w3 expands features β direct output |
### Refiner sub-graphs
| Sub-graph | Block 0 | Block 1 |
|---|---|---|
| `context_refiner` | All MXFP8 (qkv, w1, w2, w3) | qkv + w1 + w3 MXFP8, out + w2 BF16 |
| `noise_refiner` | qkv + out + w1 + w2 MXFP8, w3 BF16 | qkv + out + w2 + w3 BF16, w1 MXFP8 |
## Generation
```bash
#!/bin/bash
# NVFP4 baseline + MXFP8 for sensitive layers + BF16 at critical points.
# Refiners: block 0 fully MXFP8, block 1 outputs kept in BF16.
# Last QKV (layer 29), late adaLN (22-29), and refiner outputs in BF16.
# All main-trunk w1 (gate) projections in MXFP8.
convert_to_quant -i $1 \
--nvfp4 --zimage --comfy_quant --save-quant-metadata \
--custom-type mxfp8 \
--custom-layers "layers\.(10|16|26)\.attention\.qkv\.weight|layers\.(27|28)\.attention\.qkv\.weight|layers\.(0|1)\.attention\.out\.weight|layers\.(3|6|9|11|12|13|14|19|20|26)\.attention\.out\.weight|layers\.(27|28|29)\.attention\.out\.weight|layers\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26)\.feed_forward\.w1\.weight|layers\.(27|28|29)\.feed_forward\.w1\.weight|layers\.(16|17|18|19|20|21)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.qkv\.weight|context_refiner\.(0|1)\.feed_forward\.w1\.weight|context_refiner\.(0|1)\.feed_forward\.w2\.weight|context_refiner\.(0|1)\.feed_forward\.w3\.weight|noise_refiner\.(0)\.attention\.(qkv|out)\.weight|noise_refiner\.(0)\.feed_forward\.(w1|w2)\.weight|noise_refiner\.(1)\.feed_forward\.w1\.weight" \
--exclude-layers "layers\.(29)\.attention\.qkv\.weight|layers\.(22|23|24|25|26)\.adaLN_modulation\.0\.weight|layers\.(27|28|29)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.out\.weight|context_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(1)\.attention\.qkv\.weight|noise_refiner\.(1)\.attention\.out\.weight|noise_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(0|1)\.feed_forward\.w3\.weight" \
--num-iter 6000 --top-p 0.35 --calib-samples 8192 \
--scale-optimization iterative --scale-refinement-rounds 2 \
--extract-lora --lora-rank 32 \
-o "${1%%.safetensors}-nvfp4.safetensors"
```
### Included files
| File | Description |
|---|---|
| `z_image_turbo_nvfp4.safetensors` | Quantized weights |
| `z_image_turbo_nvfp4_lora.safetensors` | Error-correction LoRA (rank 32) |
Use the LoRA with variable strength in ComfyUI for improved fidelity.
## Requirements
- **Inference**: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200)
- **Generation**: `convert_to_quant >= 1.2.6`, `comfy-kitchen`
## Comparison
| | NVFP4 Mixed (this) | MXFP8 Uniform | Official NVFP4 |
| --- | --- | --- | --- |
| Size | 4.84 GB | 6.23 GB | 4.51 GB |
| Base format | NVFP4 (4-bit) | MXFP8 (8-bit) | NVFP4 (4-bit) |
| Custom layers | ~100 tensors β MXFP8 | None | None |
| BF16 exclusions | ~20 tensors | 8 patterns | Refiners fully BF16 |
| Learned rounding | β
6000 iter | β --simple | β |
| LoRA | β
rank 32 | β | β |
| Refiner block 0 | MXFP8 | MXFP8 | BF16 |
| Late adaLN (22β29) | BF16 | BF16 | NVFP4 β οΈ |
| Last QKV (layer 29) | BF16 | BF16 | NVFP4 β οΈ |
| Quantization timeΒΉ | ~60β90 min | ~5β10 min | N/A |
ΒΉ Estimated on RTX 5060 (Blackwell) with `comfy-kitchen` CUDA kernels.
## Methodology
Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `*KEEP*`, `FP8`, or `NVFP4`.
Recommendations were cross-referenced against the DiT quantization literature:
- **PTQ4DiT** (NeurIPS 2024) β salient channels in QKV + FFN, last blocks most affected
- **ViDiT-Q** (ICLR 2025) β metric-decoupled sensitivity: self-attention dominates visual quality
- **HTG** (2025) β channel-dependent outliers, severe in later blocks
- **SemanticDialect** (2026) β block-wise mixed-format validated for video DiTs
- **SVDQuant** (ICLR 2025) β low-rank branch absorbs 4-bit error, validated NVFP4
## Credits
- Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides
- Z-Image Turbo model by [Tongyi-MAI](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)
- ComfyUI integration via [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen)
- Layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe) |