Buckets:

Mercity/FluxDistill / model-card.md
Pranav2748's picture
|
download
raw
4.06 kB
metadata
license: apache-2.0
base_model: black-forest-labs/FLUX.2-klein-4B
pipeline_tag: text-to-image
tags:
  - text-to-image
  - flux
  - nvfp4
  - svdquant
  - quantization
  - blackwell
  - nunchaku

FLUX.2 [klein] 4B — SVDQuant NVFP4 (W4A4, rank-128) — deployable on Blackwell

A 4-bit (NVFP4 W4A4) quantization of black-forest-labs/FLUX.2-klein-4B (the 4-step, CFG-free distilled rectified-flow text-to-image model) that adds a SVDQuant high-precision low-rank branch to stay closer to the bf16 teacher than plain NVFP4, and runs on the real Nunchaku FP4 kernel on NVIDIA Blackwell (~1.8× faster, ~29% less VRAM).

What this is: an ablation-backed, deployable NVFP4 build of klein-4B. SVDQuant (Li et al., ICLR 2025), NVFP4, and the Nunchaku kernel are prior art; the contributions here are (1) a clean, reproducible measurement that the low-rank branch improves NVFP4 W4A4 fidelity to the teacher on this model, (2) a DeepCompressor-free bf16→Nunchaku-NVFP4 exporter (no public equivalent for klein), and (3) a verified deployable checkpoint on Blackwell.

Method (calibration-free)

Per target Linear: W ≈ L₁L₂ (bf16 low-rank, rank 128) + Q_nvfp4(W − L₁L₂) (4-bit residual), activations NVFP4 per-group-16. The low-rank branch is a plain SVD of the weights with 3 iterations of refinement to absorb the 4-bit rounding error — no activation calibration / no smoothing / no whitening is needed (SmoothQuant is harmful at this bit-width on this model; per-group NVFP4 activations need no migration). The branch is free at inference (folded into the fused kernel). Exporter: flux2distill/nunchaku_export.py.

Results (N=512, MJHQ-30k, 512px, vs the bf16 teacher)

Fidelity-to-teacher (LPIPS/PSNR/FID vs the teacher's own outputs); PickScore/CLIP stay at teacher level for both 4-bit models, so quantization preserves semantics — the low-rank branch buys fidelity.

Measured N=512, MJHQ-30k, paired seeds. Our rank-128 model vs plain NVFP4 rank-0 (same recipe, no low-rank branch):

metric (vs teacher) plain NVFP4 r0 ours NVFP4 r128 Δ
LPIPS↓ 0.2076 0.1668 −19.7% (closer)
PSNR↑ 17.44 dB 18.71 dB +1.27 dB
FID↓ (vs teacher) 39.3 33.5 −14.7%
PickScore↑ / CLIP↑ 21.62 / 30.95 21.62 / 30.94 ≈0 (no semantic loss; teacher = 21.64 / 30.95)

The real Nunchaku FP4 kernel reproduces the fake-quant quality (LPIPS 0.167 vs 0.173), so the gain is real on the deployed model. (For reference, BFL's official FP8 klein-4B — 8-bit, ~1.2× speedup — sits closer to the teacher at LPIPS 0.080 / PSNR 23.0; ours is the 4-bit/2.5×-kernel point with the low-rank branch closing most of the gap.)

Speed / VRAM (RTX PRO 4500 Blackwell, batch 1, 4 steps, real Nunchaku FP4 kernel)

512px 1024px
end-to-end speedup vs bf16 1.76× 1.87×
peak VRAM 11.8 GB (−29%) 13.7 GB (−26%)
per-step 42 ms (vs 92) 144 ms (vs 330)

INT4 is slower than bf16 on Blackwell (no INT4 tensor cores) — NVFP4 is the fast low-bit format here.

Usage

Requires a Blackwell GPU (sm_120), torch≥2.9+cu130, and the Nunchaku FP4 runtime with the FLUX.2 loader. Batch size must be 1 (packed rotary), and do not wrap generation in torch.autocast (the FP4 kernel manages its own dtypes). See docs/CUDA_SETUP_RUNBOOK.md and scripts/35_gen_real.py for the exact path.

License

Apache-2.0, inherited from the base model black-forest-labs/FLUX.2-klein-4B.

Limitations

  • Blackwell-only for the fast path (NVFP4 tensor cores). Batch=1 in the current fused build.
  • Fidelity claim is vs the bf16 teacher, measured at 512px on MJHQ-30k (N=512), klein-4B only.
  • Not benchmarked head-to-head against BFL's official NVFP4 checkpoint (its swizzled layout requires BFL's TensorRT runtime to unpack); see report/HEADTOHEAD_klein4b_nvfp4.md §5.

Xet Storage Details

Size:
4.06 kB
·
Xet hash:
0c8aa440bc2690473df80bca2b4f77175a6e671579d7df01860e9f7010c45fd9

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.