Instructions to use InsecureErasure/Z-Image-Turbo-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use InsecureErasure/Z-Image-Turbo-NVFP4 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("InsecureErasure/Z-Image-Turbo-NVFP4", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
| license: apache-2.0 | |
| language: | |
| - en | |
| - zh | |
| base_model: | |
| - Tongyi-MAI/Z-Image-Turbo | |
| base_model_relation: quantized | |
| pipeline_tag: text-to-image | |
| library_name: diffusers | |
| tags: | |
| - comfyui | |
| - quantization | |
| - nvfp4 | |
| - txt2img | |
| # Z-Image Turbo - NVFP4 Mixed-Precision | |
| Surgical mixed-precision quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant). | |
| **Formats**: NVFP4 (baseline) + MXFP8 (sensitive layers) + BF16 (critical layers). | |
| **Size**: 4.84 GB (-58% vs BF16). | |
| **Inference**: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200). | |
| Also available: [MXFP8 uniform quantization](https://huggingface.co/InsecureErasure/Z-Image-Turbo-MXFP8) (6.23 GB, near-lossless). | |
|  | |
|  | |
| * **Prompt:** | |
| ``` | |
| A bust portrait of a woman in her mid-twenties with messy dark hair tied in a loose bun, wearing a worn denim jacket over a gray hoodie. | |
| She is leaning her elbows on a washing machine, her chin resting on her folded hands. Behind her, a row of industrial dryers against a tiled wall, | |
| with one dryer door hanging open. Above the dryers, a handwritten sign taped to the wall says 'OUT OF ORDER' in black marker, | |
| with a small smiley face drawn on it. To her left, a plastic basket overflows with unfolded clothes. To her right, a vending machine glows green, | |
| displaying 'SOAP $1.50' on a small digital screen. The light is cool and buzzing, like fluorescent tubes overhead. She looks tired but amused | |
| with a faint smirk. | |
| ``` | |
| * **Sampler/Scheduler:** Euler/Simple | |
| * **Steps:** 9 | |
| * **CFG:** 1.0 | |
| * **Shift:** 3.0 | |
| * **Seed:** 920698660737993 | |
| * **Resolution:** 1024 x 1536 | |
| ## Strategy | |
| Uses per-layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe) and the DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect, SVDQuant) to maximize quality-per-byte: | |
| - **~190 tensors β NVFP4** (4-bit E2M1): baseline for most attention + FF weights | |
| - **~100 tensors β MXFP8** (8-bit E4M3 + E8M0): attention outputs, gate projections (w1), mid-block adaLN | |
| - **~20 tensors β BF16**: last QKV, late adaLN modulations, refiner outputs | |
| - **~110 tensors β BF16**: norms, biases, embeddings (auto-excluded by `--zimage`) | |
| ### MXFP8-protected layers | |
| | Category | Blocks | Layers | | |
| |---|---|---| | |
| | Early attention outputs | 0, 1 | `attention.out` | | |
| | Selected QKV projections | 10, 16, 26, 27, 28 | `attention.qkv` | | |
| | Attention outputs | 3, 6, 9, 11β14, 19, 20, 26β29 | `attention.out` | | |
| | Gate projections (w1) | 3β29 | `feed_forward.w1` | | |
| | Mid-block modulations | 16β21 | `adaLN_modulation.0` | | |
| ### BF16-protected layers | |
| | Category | Layers | Reason | | |
| |---|---|---| | |
| | Last QKV | `layers.29.attention.qkv` | Feeds directly into `final_layer` β no downstream compensation | | |
| | Late modulations | `layers.(22β29).adaLN_modulation.0` | Controls scale/shift of features near output | | |
| | Refiner attention outputs | `context_refiner.(0\|1).attention.out` | Only 2 refiner blocks β outputs have outsized impact | | |
| | Selected refiner FF | `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` | Critical single-block projections | | |
| | Refiner up-projections | `noise_refiner.(0\|1).w3` | Noise refiner w3 expands features β direct output | | |
| ### Refiner sub-graphs | |
| | Sub-graph | Block 0 | Block 1 | | |
| |---|---|---| | |
| | `context_refiner` | All MXFP8 (qkv, w1, w2, w3) | qkv + w1 + w3 MXFP8, out + w2 BF16 | | |
| | `noise_refiner` | qkv + out + w1 + w2 MXFP8, w3 BF16 | qkv + out + w2 + w3 BF16, w1 MXFP8 | | |
| ## Generation | |
| ```bash | |
| #!/bin/bash | |
| # NVFP4 baseline + MXFP8 for sensitive layers + BF16 at critical points. | |
| # Refiners: block 0 fully MXFP8, block 1 outputs kept in BF16. | |
| # Last QKV (layer 29), late adaLN (22-29), and refiner outputs in BF16. | |
| # All main-trunk w1 (gate) projections in MXFP8. | |
| convert_to_quant -i $1 \ | |
| --nvfp4 --zimage --comfy_quant --save-quant-metadata \ | |
| --custom-type mxfp8 \ | |
| --custom-layers "layers\.(10|16|26)\.attention\.qkv\.weight|layers\.(27|28)\.attention\.qkv\.weight|layers\.(0|1)\.attention\.out\.weight|layers\.(3|6|9|11|12|13|14|19|20|26)\.attention\.out\.weight|layers\.(27|28|29)\.attention\.out\.weight|layers\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26)\.feed_forward\.w1\.weight|layers\.(27|28|29)\.feed_forward\.w1\.weight|layers\.(16|17|18|19|20|21)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.qkv\.weight|context_refiner\.(0|1)\.feed_forward\.w1\.weight|context_refiner\.(0|1)\.feed_forward\.w2\.weight|context_refiner\.(0|1)\.feed_forward\.w3\.weight|noise_refiner\.(0)\.attention\.(qkv|out)\.weight|noise_refiner\.(0)\.feed_forward\.(w1|w2)\.weight|noise_refiner\.(1)\.feed_forward\.w1\.weight" \ | |
| --exclude-layers "layers\.(29)\.attention\.qkv\.weight|layers\.(22|23|24|25|26)\.adaLN_modulation\.0\.weight|layers\.(27|28|29)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.out\.weight|context_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(1)\.attention\.qkv\.weight|noise_refiner\.(1)\.attention\.out\.weight|noise_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(0|1)\.feed_forward\.w3\.weight" \ | |
| --num-iter 6000 --top-p 0.35 --calib-samples 8192 \ | |
| --scale-optimization iterative --scale-refinement-rounds 2 \ | |
| --extract-lora --lora-rank 32 \ | |
| -o "${1%%.safetensors}-nvfp4.safetensors" | |
| ``` | |
| ### Included files | |
| | File | Description | | |
| |---|---| | |
| | `z_image_turbo_nvfp4.safetensors` | Quantized weights | | |
| | `z_image_turbo_nvfp4_lora.safetensors` | Error-correction LoRA (rank 32) | | |
| Use the LoRA with variable strength in ComfyUI for improved fidelity. | |
| ## Requirements | |
| - **Inference**: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200) | |
| - **Generation**: `convert_to_quant >= 1.2.6`, `comfy-kitchen` | |
| ## Comparison | |
| | | NVFP4 Mixed (this) | MXFP8 Uniform | Official NVFP4 | | |
| | --- | --- | --- | --- | | |
| | Size | 4.84 GB | 6.23 GB | 4.51 GB | | |
| | Base format | NVFP4 (4-bit) | MXFP8 (8-bit) | NVFP4 (4-bit) | | |
| | Custom layers | ~100 tensors β MXFP8 | None | None | | |
| | BF16 exclusions | ~20 tensors | 8 patterns | Refiners fully BF16 | | |
| | Learned rounding | β 6000 iter | β --simple | β | | |
| | LoRA | β rank 32 | β | β | | |
| | Refiner block 0 | MXFP8 | MXFP8 | BF16 | | |
| | Late adaLN (22β29) | BF16 | BF16 | NVFP4 β οΈ | | |
| | Last QKV (layer 29) | BF16 | BF16 | NVFP4 β οΈ | | |
| | Quantization timeΒΉ | ~60β90 min | ~5β10 min | N/A | | |
| ΒΉ Estimated on RTX 5060 (Blackwell) with `comfy-kitchen` CUDA kernels. | |
| ## Methodology | |
| Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `*KEEP*`, `FP8`, or `NVFP4`. | |
| Recommendations were cross-referenced against the DiT quantization literature: | |
| - **PTQ4DiT** (NeurIPS 2024) β salient channels in QKV + FFN, last blocks most affected | |
| - **ViDiT-Q** (ICLR 2025) β metric-decoupled sensitivity: self-attention dominates visual quality | |
| - **HTG** (2025) β channel-dependent outliers, severe in later blocks | |
| - **SemanticDialect** (2026) β block-wise mixed-format validated for video DiTs | |
| - **SVDQuant** (ICLR 2025) β low-rank branch absorbs 4-bit error, validated NVFP4 | |
| ## Credits | |
| - Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides | |
| - Z-Image Turbo model by [Tongyi-MAI](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) | |
| - ComfyUI integration via [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen) | |
| - Layer sensitivity analysis via [`quant_probe`](https://github.com/insecure-erasure/quant_probe) |