Buckets:

Mercity/FluxDistill / model-card.md
Pranav2748's picture
|
download
raw
4.06 kB
---
license: apache-2.0
base_model: black-forest-labs/FLUX.2-klein-4B
pipeline_tag: text-to-image
tags:
- text-to-image
- flux
- nvfp4
- svdquant
- quantization
- blackwell
- nunchaku
---
# FLUX.2 [klein] 4B — SVDQuant NVFP4 (W4A4, rank-128) — deployable on Blackwell
A **4-bit (NVFP4 W4A4)** quantization of [`black-forest-labs/FLUX.2-klein-4B`](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B)
(the 4-step, CFG-free distilled rectified-flow text-to-image model) that adds a **SVDQuant high-precision
low-rank branch** to stay closer to the bf16 teacher than plain NVFP4, and **runs on the real Nunchaku FP4
kernel on NVIDIA Blackwell** (~1.8× faster, ~29% less VRAM).
> **What this is:** an *ablation-backed, deployable* NVFP4 build of klein-4B. SVDQuant (Li et al., ICLR 2025),
> NVFP4, and the Nunchaku kernel are prior art; the contributions here are (1) a clean, reproducible
> measurement that the low-rank branch improves NVFP4 W4A4 **fidelity to the teacher** on this model,
> (2) a **DeepCompressor-free** bf16→Nunchaku-NVFP4 exporter (no public equivalent for klein), and
> (3) a verified deployable checkpoint on Blackwell.
## Method (calibration-free)
Per target Linear: `W ≈ L₁L₂ (bf16 low-rank, rank 128) + Q_nvfp4(W − L₁L₂)` (4-bit residual), activations
NVFP4 per-group-16. The low-rank branch is a **plain SVD of the weights** with 3 iterations of refinement to
absorb the 4-bit rounding error — **no activation calibration / no smoothing / no whitening** is needed
(SmoothQuant is harmful at this bit-width on this model; per-group NVFP4 activations need no migration).
The branch is free at inference (folded into the fused kernel). Exporter: `flux2distill/nunchaku_export.py`.
## Results (N=512, MJHQ-30k, 512px, vs the bf16 teacher)
Fidelity-to-teacher (LPIPS/PSNR/FID vs the teacher's own outputs); PickScore/CLIP stay at teacher level for
both 4-bit models, so quantization preserves semantics — the low-rank branch buys **fidelity**.
Measured N=512, MJHQ-30k, paired seeds. Our **rank-128** model vs **plain NVFP4 rank-0** (same recipe, no low-rank branch):
| metric (vs teacher) | plain NVFP4 r0 | **ours NVFP4 r128** | Δ |
|---|---|---|---|
| LPIPS↓ | 0.2076 | **0.1668** | **−19.7%** (closer) |
| PSNR↑ | 17.44 dB | **18.71 dB** | **+1.27 dB** |
| FID↓ (vs teacher) | 39.3 | **33.5** | **−14.7%** |
| PickScore↑ / CLIP↑ | 21.62 / 30.95 | 21.62 / 30.94 | ≈0 (no semantic loss; teacher = 21.64 / 30.95) |
The real Nunchaku FP4 kernel reproduces the fake-quant quality (LPIPS 0.167 vs 0.173), so the gain is real on
the deployed model. (For reference, BFL's official **FP8** klein-4B — 8-bit, ~1.2× speedup — sits closer to the
teacher at LPIPS 0.080 / PSNR 23.0; ours is the 4-bit/2.5×-kernel point with the low-rank branch closing most of the gap.)
## Speed / VRAM (RTX PRO 4500 Blackwell, batch 1, 4 steps, real Nunchaku FP4 kernel)
| | 512px | 1024px |
|---|---|---|
| end-to-end speedup vs bf16 | **1.76×** | **1.87×** |
| peak VRAM | **11.8 GB** (−29%) | **13.7 GB** (−26%) |
| per-step | 42 ms (vs 92) | 144 ms (vs 330) |
INT4 is **slower than bf16** on Blackwell (no INT4 tensor cores) — NVFP4 is the fast low-bit format here.
## Usage
Requires a Blackwell GPU (sm_120), `torch≥2.9+cu130`, and the Nunchaku FP4 runtime with the FLUX.2 loader.
Batch size must be **1** (packed rotary), and do **not** wrap generation in `torch.autocast` (the FP4 kernel
manages its own dtypes). See `docs/CUDA_SETUP_RUNBOOK.md` and `scripts/35_gen_real.py` for the exact path.
## License
Apache-2.0, inherited from the base model `black-forest-labs/FLUX.2-klein-4B`.
## Limitations
- Blackwell-only for the fast path (NVFP4 tensor cores). Batch=1 in the current fused build.
- Fidelity claim is **vs the bf16 teacher**, measured at 512px on MJHQ-30k (N=512), klein-4B only.
- Not benchmarked head-to-head against BFL's official NVFP4 checkpoint (its swizzled layout requires BFL's
TensorRT runtime to unpack); see `report/HEADTOHEAD_klein4b_nvfp4.md` §5.

Xet Storage Details

Size:
4.06 kB
·
Xet hash:
0c8aa440bc2690473df80bca2b4f77175a6e671579d7df01860e9f7010c45fd9

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.