Buckets:

Mercity
/

FluxDistill

Files

xet

Mercity/FluxDistill / model-card.md

Pranav2748

20 days ago

preview code

download

raw

4.06 kB

	---
	license: apache-2.0
	base_model: black-forest-labs/FLUX.2-klein-4B
	pipeline_tag: text-to-image
	tags:
	- text-to-image
	- flux
	- nvfp4
	- svdquant
	- quantization
	- blackwell
	- nunchaku
	---

	# FLUX.2 [klein] 4B — SVDQuant NVFP4 (W4A4, rank-128) — deployable on Blackwell

	A 4-bit (NVFP4 W4A4) quantization of [`black-forest-labs/FLUX.2-klein-4B`](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B)
	(the 4-step, CFG-free distilled rectified-flow text-to-image model) that adds a **SVDQuant high-precision
	low-rank branch to stay closer to the bf16 teacher than plain NVFP4, and runs on the real Nunchaku FP4
	kernel on NVIDIA Blackwell** (~1.8× faster, ~29% less VRAM).

	> What this is: an ablation-backed, deployable NVFP4 build of klein-4B. SVDQuant (Li et al., ICLR 2025),
	> NVFP4, and the Nunchaku kernel are prior art; the contributions here are (1) a clean, reproducible
	> measurement that the low-rank branch improves NVFP4 W4A4 fidelity to the teacher on this model,
	> (2) a DeepCompressor-free bf16→Nunchaku-NVFP4 exporter (no public equivalent for klein), and
	> (3) a verified deployable checkpoint on Blackwell.

	## Method (calibration-free)

	Per target Linear: `W ≈ L₁L₂ (bf16 low-rank, rank 128) + Q_nvfp4(W − L₁L₂)` (4-bit residual), activations
	NVFP4 per-group-16. The low-rank branch is a plain SVD of the weights with 3 iterations of refinement to
	absorb the 4-bit rounding error — no activation calibration / no smoothing / no whitening is needed
	(SmoothQuant is harmful at this bit-width on this model; per-group NVFP4 activations need no migration).
	The branch is free at inference (folded into the fused kernel). Exporter: `flux2distill/nunchaku_export.py`.

	## Results (N=512, MJHQ-30k, 512px, vs the bf16 teacher)

	Fidelity-to-teacher (LPIPS/PSNR/FID vs the teacher's own outputs); PickScore/CLIP stay at teacher level for
	both 4-bit models, so quantization preserves semantics — the low-rank branch buys fidelity.

	Measured N=512, MJHQ-30k, paired seeds. Our rank-128 model vs plain NVFP4 rank-0 (same recipe, no low-rank branch):

	\| metric (vs teacher) \| plain NVFP4 r0 \| ours NVFP4 r128 \| Δ \|
	\|---\|---\|---\|---\|
	\| LPIPS↓ \| 0.2076 \| 0.1668 \| −19.7% (closer) \|
	\| PSNR↑ \| 17.44 dB \| 18.71 dB \| +1.27 dB \|
	\| FID↓ (vs teacher) \| 39.3 \| 33.5 \| −14.7% \|
	\| PickScore↑ / CLIP↑ \| 21.62 / 30.95 \| 21.62 / 30.94 \| ≈0 (no semantic loss; teacher = 21.64 / 30.95) \|

	The real Nunchaku FP4 kernel reproduces the fake-quant quality (LPIPS 0.167 vs 0.173), so the gain is real on
	the deployed model. (For reference, BFL's official FP8 klein-4B — 8-bit, ~1.2× speedup — sits closer to the
	teacher at LPIPS 0.080 / PSNR 23.0; ours is the 4-bit/2.5×-kernel point with the low-rank branch closing most of the gap.)

	## Speed / VRAM (RTX PRO 4500 Blackwell, batch 1, 4 steps, real Nunchaku FP4 kernel)

	\| \| 512px \| 1024px \|
	\|---\|---\|---\|
	\| end-to-end speedup vs bf16 \| 1.76× \| 1.87× \|
	\| peak VRAM \| 11.8 GB (−29%) \| 13.7 GB (−26%) \|
	\| per-step \| 42 ms (vs 92) \| 144 ms (vs 330) \|

	INT4 is slower than bf16 on Blackwell (no INT4 tensor cores) — NVFP4 is the fast low-bit format here.

	## Usage

	Requires a Blackwell GPU (sm_120), `torch≥2.9+cu130`, and the Nunchaku FP4 runtime with the FLUX.2 loader.
	Batch size must be 1 (packed rotary), and do not wrap generation in `torch.autocast` (the FP4 kernel
	manages its own dtypes). See `docs/CUDA_SETUP_RUNBOOK.md` and `scripts/35_gen_real.py` for the exact path.

	## License

	Apache-2.0, inherited from the base model `black-forest-labs/FLUX.2-klein-4B`.

	## Limitations

	- Blackwell-only for the fast path (NVFP4 tensor cores). Batch=1 in the current fused build.
	- Fidelity claim is vs the bf16 teacher, measured at 512px on MJHQ-30k (N=512), klein-4B only.
	- Not benchmarked head-to-head against BFL's official NVFP4 checkpoint (its swizzled layout requires BFL's
	TensorRT runtime to unpack); see `report/HEADTOHEAD_klein4b_nvfp4.md` §5.

Xet Storage Details

Size:: 4.06 kB
Xet hash:: 0c8aa440bc2690473df80bca2b4f77175a6e671579d7df01860e9f7010c45fd9

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.