Buckets:
license: apache-2.0
base_model: black-forest-labs/FLUX.2-klein-4B
pipeline_tag: text-to-image
tags:
- text-to-image
- flux
- nvfp4
- svdquant
- quantization
- blackwell
- nunchaku
FLUX.2 [klein] 4B — SVDQuant NVFP4 (W4A4, rank-128) — deployable on Blackwell
A 4-bit (NVFP4 W4A4) quantization of black-forest-labs/FLUX.2-klein-4B
(the 4-step, CFG-free distilled rectified-flow text-to-image model) that adds a SVDQuant high-precision
low-rank branch to stay closer to the bf16 teacher than plain NVFP4, and runs on the real Nunchaku FP4
kernel on NVIDIA Blackwell (~1.8× faster, ~29% less VRAM).
What this is: an ablation-backed, deployable NVFP4 build of klein-4B. SVDQuant (Li et al., ICLR 2025), NVFP4, and the Nunchaku kernel are prior art; the contributions here are (1) a clean, reproducible measurement that the low-rank branch improves NVFP4 W4A4 fidelity to the teacher on this model, (2) a DeepCompressor-free bf16→Nunchaku-NVFP4 exporter (no public equivalent for klein), and (3) a verified deployable checkpoint on Blackwell.
Method (calibration-free)
Per target Linear: W ≈ L₁L₂ (bf16 low-rank, rank 128) + Q_nvfp4(W − L₁L₂) (4-bit residual), activations
NVFP4 per-group-16. The low-rank branch is a plain SVD of the weights with 3 iterations of refinement to
absorb the 4-bit rounding error — no activation calibration / no smoothing / no whitening is needed
(SmoothQuant is harmful at this bit-width on this model; per-group NVFP4 activations need no migration).
The branch is free at inference (folded into the fused kernel). Exporter: flux2distill/nunchaku_export.py.
Results (N=512, MJHQ-30k, 512px, vs the bf16 teacher)
Fidelity-to-teacher (LPIPS/PSNR/FID vs the teacher's own outputs); PickScore/CLIP stay at teacher level for both 4-bit models, so quantization preserves semantics — the low-rank branch buys fidelity.
Measured N=512, MJHQ-30k, paired seeds. Our rank-128 model vs plain NVFP4 rank-0 (same recipe, no low-rank branch):
| metric (vs teacher) | plain NVFP4 r0 | ours NVFP4 r128 | Δ |
|---|---|---|---|
| LPIPS↓ | 0.2076 | 0.1668 | −19.7% (closer) |
| PSNR↑ | 17.44 dB | 18.71 dB | +1.27 dB |
| FID↓ (vs teacher) | 39.3 | 33.5 | −14.7% |
| PickScore↑ / CLIP↑ | 21.62 / 30.95 | 21.62 / 30.94 | ≈0 (no semantic loss; teacher = 21.64 / 30.95) |
The real Nunchaku FP4 kernel reproduces the fake-quant quality (LPIPS 0.167 vs 0.173), so the gain is real on the deployed model. (For reference, BFL's official FP8 klein-4B — 8-bit, ~1.2× speedup — sits closer to the teacher at LPIPS 0.080 / PSNR 23.0; ours is the 4-bit/2.5×-kernel point with the low-rank branch closing most of the gap.)
Speed / VRAM (RTX PRO 4500 Blackwell, batch 1, 4 steps, real Nunchaku FP4 kernel)
| 512px | 1024px | |
|---|---|---|
| end-to-end speedup vs bf16 | 1.76× | 1.87× |
| peak VRAM | 11.8 GB (−29%) | 13.7 GB (−26%) |
| per-step | 42 ms (vs 92) | 144 ms (vs 330) |
INT4 is slower than bf16 on Blackwell (no INT4 tensor cores) — NVFP4 is the fast low-bit format here.
Usage
Requires a Blackwell GPU (sm_120), torch≥2.9+cu130, and the Nunchaku FP4 runtime with the FLUX.2 loader.
Batch size must be 1 (packed rotary), and do not wrap generation in torch.autocast (the FP4 kernel
manages its own dtypes). See docs/CUDA_SETUP_RUNBOOK.md and scripts/35_gen_real.py for the exact path.
License
Apache-2.0, inherited from the base model black-forest-labs/FLUX.2-klein-4B.
Limitations
- Blackwell-only for the fast path (NVFP4 tensor cores). Batch=1 in the current fused build.
- Fidelity claim is vs the bf16 teacher, measured at 512px on MJHQ-30k (N=512), klein-4B only.
- Not benchmarked head-to-head against BFL's official NVFP4 checkpoint (its swizzled layout requires BFL's
TensorRT runtime to unpack); see
report/HEADTOHEAD_klein4b_nvfp4.md§5.
Xet Storage Details
- Size:
- 4.06 kB
- Xet hash:
- 0c8aa440bc2690473df80bca2b4f77175a6e671579d7df01860e9f7010c45fd9
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.