InsecureErasure commited on
Commit
577bacb
Β·
verified Β·
1 Parent(s): 8d3c62c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -29
README.md CHANGED
@@ -16,36 +16,29 @@ tags:
16
  ---
17
 
18
 
19
- # Z-Image Turbo β€” MXFP8 Uniform
20
 
21
- Uniform 8-bit microscaling quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant).
22
 
23
- **Format**: MXFP8 (8-bit E4M3 + E8M0 block scales) with minimal BF16 exclusions.
24
- **Size**: 6.23 GB (βˆ’46% vs BF16).
25
- **Inference**: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200).
26
 
27
  ![ZiT-MXFP8-01.png](images/ZiT-MXFP8-01.png)
28
  ![ZiT-MXFP8-02.png](images/ZiT-MXFP8-02.png)
29
 
30
- ---
31
 
32
- ## Why MXFP8?
33
 
34
  At 8-bit E4M3 with microscaling (E8M0, block=32), the quantization grid has 256 values β€” 16Γ— finer than NVFP4's 4-bit grid. The DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect) and our own `quant_probe` analysis converge on the same conclusion:
35
 
36
- **At 8-bit weight-only, per-layer format selection is overkill.** The format itself is near-lossless. Learned rounding, LoRA error correction, and scale optimization - all critical at 4-bit - provide diminishing returns here.
37
-
38
- What _does_ matter: keeping a handful of architecturally critical layers in BF16. Everything else goes to MXFP8.
39
-
40
- ### Key design decisions
41
 
42
  - **`--simple`**: skips learned rounding. Bias correction (always active) handles systematic error. Rounding noise at 8-bit is below perceptibility.
43
- - **No LoRA**: the residual quantization error at 8-bit is <0.1% MSE.
44
  - **8 exclusion patterns**: only the layers that `quant_probe` and the literature flag as critical.
45
 
46
- ---
47
-
48
- ## BF16-excluded layers (8 patterns)
49
 
50
  | Category | Layers | Reason |
51
  |---|---|---|
@@ -55,11 +48,7 @@ What _does_ matter: keeping a handful of architecturally critical layers in BF16
55
  | Selected refiner FF | `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` | Critical single-block projections |
56
  | Refiner up-projections | `noise_refiner.(0\|1).w3` | Noise refiner w3 expands features β†’ direct output |
57
 
58
- ### Everything else: MXFP8
59
-
60
- All other weight tensors β€” attention projections, feed-forward layers, early/mid-block modulations, refiner block 0 β€” use MXFP8 uniformly.
61
-
62
- ---
63
 
64
  ## Generation
65
 
@@ -75,15 +64,11 @@ convert_to_quant -i $1 \
75
  -o "${1%%.safetensors}-mxfp8.safetensors"
76
  ```
77
 
78
- ---
79
-
80
  ## Requirements
81
 
82
  - **Inference**: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx)
83
  - **Generation**: `convert_to_quant >= 1.2.6`, `comfy-kitchen`
84
 
85
- ---
86
-
87
  ## Methodology
88
 
89
  Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `*KEEP*`, `FP8`, or `NVFP4`.
@@ -96,10 +81,6 @@ Recommendations were cross-referenced against the DiT quantization literature:
96
  - **SemanticDialect** (2026) β€” block-wise mixed-format validated for video DiTs
97
  - **SVDQuant** (ICLR 2025) β€” low-rank branch absorbs 4-bit error, validated NVFP4
98
 
99
- The conclusion: at 8-bit weight-only, the format itself is sufficient. Surgical precision matters at 4-bit, not at 8-bit.
100
-
101
- ---
102
-
103
  ## Credits
104
 
105
  - Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides
 
16
  ---
17
 
18
 
19
+ # Z-Image Turbo MXFP8
20
 
21
+ Mixed 8-bit microscaling quantization of [Z-Image Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) (6B S3-DiT), generated with [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant).
22
 
23
+ * **Format**: MXFP8 (8-bit E4M3 + E8M0 block scales) with minimal BF16 exclusions.
24
+ * **Size**: 6.23 GB (βˆ’46% vs BF16).
25
+ * **Inference**: ComfyUI + [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx / B100 / B200).
26
 
27
  ![ZiT-MXFP8-01.png](images/ZiT-MXFP8-01.png)
28
  ![ZiT-MXFP8-02.png](images/ZiT-MXFP8-02.png)
29
 
 
30
 
31
+ ### Key design decisions
32
 
33
  At 8-bit E4M3 with microscaling (E8M0, block=32), the quantization grid has 256 values β€” 16Γ— finer than NVFP4's 4-bit grid. The DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect) and our own `quant_probe` analysis converge on the same conclusion:
34
 
35
+ The format itself is near-lossless. Learned rounding, LoRA error correction, and scale optimization - all critical at 4-bit - provide diminishing returns here. Keeping a handful of architecturally critical layers in BF16. Everything else goes to MXFP8.
 
 
 
 
36
 
37
  - **`--simple`**: skips learned rounding. Bias correction (always active) handles systematic error. Rounding noise at 8-bit is below perceptibility.
38
+ - **No rank LoRA**: the residual quantization error at 8-bit is <0.1% MSE.
39
  - **8 exclusion patterns**: only the layers that `quant_probe` and the literature flag as critical.
40
 
41
+ **BF16-excluded layer**
 
 
42
 
43
  | Category | Layers | Reason |
44
  |---|---|---|
 
48
  | Selected refiner FF | `context_refiner.1.w2`, `noise_refiner.1.{qkv,out,w2}` | Critical single-block projections |
49
  | Refiner up-projections | `noise_refiner.(0\|1).w3` | Noise refiner w3 expands features β†’ direct output |
50
 
51
+ All other weight tensors (attention projections, feed-forward layers, early/mid-block modulations, refiner block 0) use MXFP8.
 
 
 
 
52
 
53
  ## Generation
54
 
 
64
  -o "${1%%.safetensors}-mxfp8.safetensors"
65
  ```
66
 
 
 
67
  ## Requirements
68
 
69
  - **Inference**: CUDA 13.0+, PyTorch 2.10+, [`comfy-kitchen`](https://github.com/Comfy-Org/comfy-kitchen), Blackwell GPU (RTX 50xx)
70
  - **Generation**: `convert_to_quant >= 1.2.6`, `comfy-kitchen`
71
 
 
 
72
  ## Methodology
73
 
74
  Layer sensitivity was analyzed using [`quant_probe`](https://github.com/insecure-erasure/quant_probe), which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend `*KEEP*`, `FP8`, or `NVFP4`.
 
81
  - **SemanticDialect** (2026) β€” block-wise mixed-format validated for video DiTs
82
  - **SVDQuant** (ICLR 2025) β€” low-rank branch absorbs 4-bit error, validated NVFP4
83
 
 
 
 
 
84
  ## Credits
85
 
86
  - Quantization engine: [`convert_to_quant`](https://github.com/silveroxides/convert_to_quant) by silveroxides