Buckets:
| license: apache-2.0 | |
| base_model: black-forest-labs/FLUX.2-klein-4B | |
| pipeline_tag: text-to-image | |
| tags: | |
| - text-to-image | |
| - flux | |
| - nvfp4 | |
| - svdquant | |
| - quantization | |
| - blackwell | |
| - nunchaku | |
| # FLUX.2 [klein] 4B — SVDQuant NVFP4 (W4A4, rank-128) — deployable on Blackwell | |
| A **4-bit (NVFP4 W4A4)** quantization of [`black-forest-labs/FLUX.2-klein-4B`](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B) | |
| (the 4-step, CFG-free distilled rectified-flow text-to-image model) that adds a **SVDQuant high-precision | |
| low-rank branch** to stay closer to the bf16 teacher than plain NVFP4, and **runs on the real Nunchaku FP4 | |
| kernel on NVIDIA Blackwell** (~1.8× faster, ~29% less VRAM). | |
| > **What this is:** an *ablation-backed, deployable* NVFP4 build of klein-4B. SVDQuant (Li et al., ICLR 2025), | |
| > NVFP4, and the Nunchaku kernel are prior art; the contributions here are (1) a clean, reproducible | |
| > measurement that the low-rank branch improves NVFP4 W4A4 **fidelity to the teacher** on this model, | |
| > (2) a **DeepCompressor-free** bf16→Nunchaku-NVFP4 exporter (no public equivalent for klein), and | |
| > (3) a verified deployable checkpoint on Blackwell. | |
| ## Method (calibration-free) | |
| Per target Linear: `W ≈ L₁L₂ (bf16 low-rank, rank 128) + Q_nvfp4(W − L₁L₂)` (4-bit residual), activations | |
| NVFP4 per-group-16. The low-rank branch is a **plain SVD of the weights** with 3 iterations of refinement to | |
| absorb the 4-bit rounding error — **no activation calibration / no smoothing / no whitening** is needed | |
| (SmoothQuant is harmful at this bit-width on this model; per-group NVFP4 activations need no migration). | |
| The branch is free at inference (folded into the fused kernel). Exporter: `flux2distill/nunchaku_export.py`. | |
| ## Results (N=512, MJHQ-30k, 512px, vs the bf16 teacher) | |
| Fidelity-to-teacher (LPIPS/PSNR/FID vs the teacher's own outputs); PickScore/CLIP stay at teacher level for | |
| both 4-bit models, so quantization preserves semantics — the low-rank branch buys **fidelity**. | |
| Measured N=512, MJHQ-30k, paired seeds. Our **rank-128** model vs **plain NVFP4 rank-0** (same recipe, no low-rank branch): | |
| | metric (vs teacher) | plain NVFP4 r0 | **ours NVFP4 r128** | Δ | | |
| |---|---|---|---| | |
| | LPIPS↓ | 0.2076 | **0.1668** | **−19.7%** (closer) | | |
| | PSNR↑ | 17.44 dB | **18.71 dB** | **+1.27 dB** | | |
| | FID↓ (vs teacher) | 39.3 | **33.5** | **−14.7%** | | |
| | PickScore↑ / CLIP↑ | 21.62 / 30.95 | 21.62 / 30.94 | ≈0 (no semantic loss; teacher = 21.64 / 30.95) | | |
| The real Nunchaku FP4 kernel reproduces the fake-quant quality (LPIPS 0.167 vs 0.173), so the gain is real on | |
| the deployed model. (For reference, BFL's official **FP8** klein-4B — 8-bit, ~1.2× speedup — sits closer to the | |
| teacher at LPIPS 0.080 / PSNR 23.0; ours is the 4-bit/2.5×-kernel point with the low-rank branch closing most of the gap.) | |
| ## Speed / VRAM (RTX PRO 4500 Blackwell, batch 1, 4 steps, real Nunchaku FP4 kernel) | |
| | | 512px | 1024px | | |
| |---|---|---| | |
| | end-to-end speedup vs bf16 | **1.76×** | **1.87×** | | |
| | peak VRAM | **11.8 GB** (−29%) | **13.7 GB** (−26%) | | |
| | per-step | 42 ms (vs 92) | 144 ms (vs 330) | | |
| INT4 is **slower than bf16** on Blackwell (no INT4 tensor cores) — NVFP4 is the fast low-bit format here. | |
| ## Usage | |
| Requires a Blackwell GPU (sm_120), `torch≥2.9+cu130`, and the Nunchaku FP4 runtime with the FLUX.2 loader. | |
| Batch size must be **1** (packed rotary), and do **not** wrap generation in `torch.autocast` (the FP4 kernel | |
| manages its own dtypes). See `docs/CUDA_SETUP_RUNBOOK.md` and `scripts/35_gen_real.py` for the exact path. | |
| ## License | |
| Apache-2.0, inherited from the base model `black-forest-labs/FLUX.2-klein-4B`. | |
| ## Limitations | |
| - Blackwell-only for the fast path (NVFP4 tensor cores). Batch=1 in the current fused build. | |
| - Fidelity claim is **vs the bf16 teacher**, measured at 512px on MJHQ-30k (N=512), klein-4B only. | |
| - Not benchmarked head-to-head against BFL's official NVFP4 checkpoint (its swizzled layout requires BFL's | |
| TensorRT runtime to unpack); see `report/HEADTOHEAD_klein4b_nvfp4.md` §5. | |
Xet Storage Details
- Size:
- 4.06 kB
- Xet hash:
- 0c8aa440bc2690473df80bca2b4f77175a6e671579d7df01860e9f7010c45fd9
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.