Instructions to use InsecureErasure/Z-Image-Turbo-MXFP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use InsecureErasure/Z-Image-Turbo-MXFP8 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("InsecureErasure/Z-Image-Turbo-MXFP8", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
Update README.md
Browse files
README.md
CHANGED
|
@@ -16,36 +16,29 @@ tags:
|
|
| 16 |
---
|
| 17 |
|
| 18 |
|
| 19 |
-
# Z-Image Turbo
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
**Format**: MXFP8 (8-bit E4M3 + E8M0 block scales) with minimal BF16 exclusions.
|
| 24 |
-
**Size**: 6.23 GB (β46% vs BF16).
|
| 25 |
-
**Inference**: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200).
|
| 26 |
|
| 27 |

|
| 28 |

|
| 29 |
|
| 30 |
-
---
|
| 31 |
|
| 32 |
-
##
|
| 33 |
|
| 34 |
At 8-bit E4M3 with microscaling (E8M0, block=32), the quantization grid has 256 values β 16Γ finer than NVFP4's 4-bit grid. The DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect) and our own `quant_probe` analysis converge on the same conclusion:
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
What _does_ matter: keeping a handful of architecturally critical layers in BF16. Everything else goes to MXFP8.
|
| 39 |
-
|
| 40 |
-
### Key design decisions
|
| 41 |
|
| 42 |
- **`--simple`**: skips learned rounding. Bias correction (always active) handles systematic error. Rounding noise at 8-bit is below perceptibility.
|
| 43 |
-
- **No LoRA**: the residual quantization error at 8-bit is <0.1% MSE.
|
| 44 |
- **8 exclusion patterns**: only the layers that `quant_probe` and the literature flag as critical.
|
| 45 |
|
| 46 |
-
-
|
| 47 |
-
|
| 48 |
-
## BF16-excluded layers (8 patterns)
|
| 49 |
|
| 50 |
| Category | Layers | Reason |
|
| 51 |
|---|---|---|
|
|
@@ -55,11 +48,7 @@ What _does_ matter: keeping a handful of architecturally critical layers in BF16
|
|
| 55 |
| Selected refiner FF | `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` | Critical single-block projections |
|
| 56 |
| Refiner up-projections | `noise_refiner.(0\|1).w3` | Noise refiner w3 expands features β direct output |
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
All other weight tensors β attention projections, feed-forward layers, early/mid-block modulations, refiner block 0 β use MXFP8 uniformly.
|
| 61 |
-
|
| 62 |
-
---
|
| 63 |
|
| 64 |
## Generation
|
| 65 |
|
|
@@ -75,15 +64,11 @@ convert_to_quant -i $1 \
|
|
| 75 |
-o "${1%%.safetensors}-mxfp8.safetensors"
|
| 76 |
```
|
| 77 |
|
| 78 |
-
---
|
| 79 |
-
|
| 80 |
## Requirements
|
| 81 |
|
| 82 |
- **Inference**: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx)
|
| 83 |
- **Generation**: `convert_to_quant >= 1.2.6`, `comfy-kitchen`
|
| 84 |
|
| 85 |
-
---
|
| 86 |
-
|
| 87 |
## Methodology
|
| 88 |
|
| 89 |
Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `*KEEP*`, `FP8`, or `NVFP4`.
|
|
@@ -96,10 +81,6 @@ Recommendations were cross-referenced against the DiT quantization literature:
|
|
| 96 |
- **SemanticDialect** (2026) β block-wise mixed-format validated for video DiTs
|
| 97 |
- **SVDQuant** (ICLR 2025) β low-rank branch absorbs 4-bit error, validated NVFP4
|
| 98 |
|
| 99 |
-
The conclusion: at 8-bit weight-only, the format itself is sufficient. Surgical precision matters at 4-bit, not at 8-bit.
|
| 100 |
-
|
| 101 |
-
---
|
| 102 |
-
|
| 103 |
## Credits
|
| 104 |
|
| 105 |
- Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides
|
|
|
|
| 16 |
---
|
| 17 |
|
| 18 |
|
| 19 |
+
# Z-Image Turbo MXFP8
|
| 20 |
|
| 21 |
+
Mixed 8-bit microscaling quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant).
|
| 22 |
|
| 23 |
+
* **Format**: MXFP8 (8-bit E4M3 + E8M0 block scales) with minimal BF16 exclusions.
|
| 24 |
+
* **Size**: 6.23 GB (β46% vs BF16).
|
| 25 |
+
* **Inference**: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200).
|
| 26 |
|
| 27 |

|
| 28 |

|
| 29 |
|
|
|
|
| 30 |
|
| 31 |
+
### Key design decisions
|
| 32 |
|
| 33 |
At 8-bit E4M3 with microscaling (E8M0, block=32), the quantization grid has 256 values β 16Γ finer than NVFP4's 4-bit grid. The DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect) and our own `quant_probe` analysis converge on the same conclusion:
|
| 34 |
|
| 35 |
+
The format itself is near-lossless. Learned rounding, LoRA error correction, and scale optimization - all critical at 4-bit - provide diminishing returns here. Keeping a handful of architecturally critical layers in BF16. Everything else goes to MXFP8.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
- **`--simple`**: skips learned rounding. Bias correction (always active) handles systematic error. Rounding noise at 8-bit is below perceptibility.
|
| 38 |
+
- **No rank LoRA**: the residual quantization error at 8-bit is <0.1% MSE.
|
| 39 |
- **8 exclusion patterns**: only the layers that `quant_probe` and the literature flag as critical.
|
| 40 |
|
| 41 |
+
**BF16-excluded layer**
|
|
|
|
|
|
|
| 42 |
|
| 43 |
| Category | Layers | Reason |
|
| 44 |
|---|---|---|
|
|
|
|
| 48 |
| Selected refiner FF | `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` | Critical single-block projections |
|
| 49 |
| Refiner up-projections | `noise_refiner.(0\|1).w3` | Noise refiner w3 expands features β direct output |
|
| 50 |
|
| 51 |
+
All other weight tensors (attention projections, feed-forward layers, early/mid-block modulations, refiner block 0) use MXFP8.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
## Generation
|
| 54 |
|
|
|
|
| 64 |
-o "${1%%.safetensors}-mxfp8.safetensors"
|
| 65 |
```
|
| 66 |
|
|
|
|
|
|
|
| 67 |
## Requirements
|
| 68 |
|
| 69 |
- **Inference**: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx)
|
| 70 |
- **Generation**: `convert_to_quant >= 1.2.6`, `comfy-kitchen`
|
| 71 |
|
|
|
|
|
|
|
| 72 |
## Methodology
|
| 73 |
|
| 74 |
Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `*KEEP*`, `FP8`, or `NVFP4`.
|
|
|
|
| 81 |
- **SemanticDialect** (2026) β block-wise mixed-format validated for video DiTs
|
| 82 |
- **SVDQuant** (ICLR 2025) β low-rank branch absorbs 4-bit error, validated NVFP4
|
| 83 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
## Credits
|
| 85 |
|
| 86 |
- Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides
|