Buckets:

Mercity
/

FluxDistill

Files

xet

Mercity/FluxDistill / RESULTS.md

Pranav2748

16 days ago

preview code

download

raw

30 kB

	# Results — FLUX.2 klein 4B -> compressed student

	Eval = held-out velocity-matching loss vs teacher (lower=closer; same fixed first-16 batch
	across all rows, so quant and surgery sit on ONE axis). wall=measured s/img @512/4-step
	batch-1 on A100; flop=estimated transformer-only ratio.

	## ★★ 2026-06-14 — NVFP4 HEAD-TO-HEAD (image-space metrics, N=512) — separate axis

	A matched, paired comparison on N=512 MJHQ-30k prompts, 512px, 4 steps, guidance 1.0, seed=idx,
	using image-space fidelity-to-teacher metrics (LPIPS/PSNR/FID vs the teacher's own outputs) + a
	PickScore/CLIP semantic check. This is NOT the velocity-loss axis above — do not compare the
	numbers across sections; compare only within this table. All models share the same Qwen3 TE + VAE
	(only the transformer quant varies). Full write-up: `report/HEADTOHEAD_klein4b_nvfp4.md`; raw numbers
	`outputs/eval/h2h/metrics.json`; speed `outputs/nvfp4/benchmark_headtohead.json`.

	\| model \| bits \| PickScore↑ \| CLIP↑ \| LPIPS↓ \| PSNR↑ \| FID↓(teacher) \| FID(real) \| real-kernel speed \|
	\|-------\|------\|-----------\|-------\|--------\|-------\|---------------\|-----------\|-------------------\|
	\| A teacher (bf16) \| 16 \| 21.64 \| 30.95 \| — \| — \| — \| 89.6 \| 1.0× (0.464 s/img @512) \|
	\| D plain NVFP4 r0 \| W4A4 \| 21.62 \| 30.95 \| 0.2076 \| 17.44 \| 39.31 \| 88.5 \| — (fake-q) \|
	\| ours r128 (fake-q) \| W4A4 \| 21.61 \| 30.99 \| 0.1732 \| 18.50 \| 33.37 \| 89.6 \| — (fake-q) \|
	\| C ours r128 (REAL Nunchaku kernel) \| W4A4 \| 21.62 \| 30.94 \| 0.1668 \| 18.71 \| 33.54 \| 90.1 \| 1.76×@512 / 1.87×@1024, 12.6 GB \|
	\| E BFL official FP8 \| W8A8 \| 21.65 \| 30.94 \| 0.0798 \| 23.02 \| 18.81 \| 89.6 \| — (needs TensorRT) \|
	\| Δ low-rank branch (r0→r128) \| \| ≈0 \| ≈0 \| −19.7% \| +1.27 dB \| −14.7% \| ~flat \| — \|

	Findings. (1) The SVDQuant low-rank branch helps at NVFP4 W4A4 — r0→r128: LPIPS −19.7%,
	PSNR +1.27 dB, FID-vs-teacher −14.7% (fake-q-vs-fake-q ablation agrees: −16.6% / +1.06 dB / −15.1%);
	reproduces the prior N=256 result at N=512. (2) **The real Nunchaku FP4 kernel reproduces (slightly
	beats) the fake-quant (LPIPS 0.167 vs 0.173) → the gain holds on the deployed model. (3) No
	semantic loss — PickScore/CLIP flat across all incl. teacher. (4) BFL official FP8** is closest to
	teacher (8-bit; high-precision/low-speedup point) vs our 4-bit/2.5×-kernel point. (5) **FID-vs-real is
	~flat (~88–90 incl. teacher)** — tracks the klein-vs-MJHQ style gap, not quant; reported, not
	discriminating. (6) BFL official NVFP4 (model B) could not be run — cutlass tensor-core swizzled
	layout, needs BFL's TensorRT runtime (see `report/HEADTOHEAD_klein4b_nvfp4.md §5`); D (our plain r0) is
	the labeled controlled "plain NVFP4" stand-in, E is the real BFL baseline. No BFL number was fabricated.

	## ★ 2026-06-13 — NVFP4 (Blackwell-native FP4) + first REAL kernel speed (Nunchaku)

	Two things landed this day: (1) NVFP4 added to our fake-quant SVDQuant (`flux2distill/svdquant.py`)
	and swept; (2) the first real low-bit kernel speed numbers, by calling Nunchaku's compiled
	NVFP4 W4A4 GEMM directly. NVFP4 beats INT4 on BOTH quality and speed — it's the format for this
	box (and any Blackwell / RTX 50 / B200). Same eval axis as the 2026-06-10 cells below.

	NVFP4 format: E2M1 elements (the 8 magnitudes {0,.5,1,1.5,2,3,4,6}·sign) + group-16 blocks +
	FP8(E4M3) block scales, applied to the 4-bit residual weights (and optionally activations). The
	low-rank branch stays bf16 — it is the high-precision error-/outlier-absorbing path. Knobs:
	`WFMT={int,nvfp4}`, `AFMT={int,nvfp4,fp8}` on `scripts/12` (driver `scripts/run_nvfp4_cell.sh`).

	### Quality — NVFP4 sweep (klein-4B, plain+refine, no-smooth, same held-out velocity loss)

	\| # \| weights \| acts \| rank \| wrecon mean \| eval-loss \| vs reference \|
	\|---\|---------\|------\|------\|-------------\|---------------\|--------------\|
	\| 1 \| NVFP4 g16 \| NVFP4 g16 \| 32 \| 0.0865 \| 0.0390 \| — \|
	\| 2 \| NVFP4 g16 \| NVFP4 g16 \| 64 \| 0.0817 \| 0.0364 \| INT4 W4A4 r64 = 0.0742 (2.0× worse) \|
	\| 3 \| NVFP4 g16 \| NVFP4 g16 \| 128 \| 0.0742 \| 0.0303 \| INT4 W4A4 r128 = 0.0610 (2.0×); ≈ INT4 W4A8 champ 0.0297 \|
	\| 4 \| NVFP4 g16 \| FP8 E4M3 \| 64 \| 0.0817 \| 0.0204 \| INT4 W4A8 r64 = 0.0297 (prior overall best) \|
	\| 5 \| NVFP4 g16 \| FP8 E4M3 \| 128 \| 0.0742 \| 0.0169 ★ \| −43% vs the 0.0297 INT4 W4A8 champion \|

	NEW OVERALL QUANT CHAMPION: NVFP4-weights + FP8-acts, r128 = 0.0169. Dirs `outputs/nvfp4_*`.

	Findings: (1) NVFP4 weights ≫ INT4 weights — ~2× lower loss at matched rank; driver is the finer
	group-16 + E2M1 float grid (unit test: outlier column 0.064 vs INT4-g64 0.115). (2) **NVFP4 W4A4 r128
	(0.0303) matches the INT4 W4A8 champion (0.0297)** while keeping activations at 4-bit — full W4A4 at
	8-bit-act quality. (3) FP8 acts buy more quality (0.0169) but cost speed (below). Visually the
	champions are teacher-indistinguishable on the text + hand probes (the quant-sensitive ones).

	### Speed — REAL kernels, klein-4B layer shapes, T=1536, RTX PRO 4500 Blackwell (sm_120)

	`scripts/23_nvfp4_kernel_bench.py` calls Nunchaku's compiled `svdq_gemm_w4a4_cuda` (NVFP4 W4A4) at
	each of klein-4B's 5 distinct Linear shapes, summed over all 100 Linears. FP8 row = `torch._scaled_mm`
	(NOT Nunchaku — a proxy for the W4+FP8-act variant, which has no fused kernel on Blackwell).

	\| path \| ms/step \| speedup vs bf16 \| deploys \|
	\|------\|---------\|---------------------\|---------\|
	\| bf16 \| 73.8 \| 1.00× \| baseline \|
	\| NVFP4 W4A4 r64 (real Nunchaku kernel) \| 26.8 \| 2.75× \| cell 2 (0.0364) \|
	\| NVFP4 W4A4 r128 (real Nunchaku kernel) \| 29.7 \| 2.49× \| cell 3 (0.0303) \|
	\| FP8 proxy r64 \| 60.3 \| 1.22× \| cell 4 (0.0204) \|
	\| FP8 proxy r128 \| 60.8 \| 1.21× \| cell 5 (0.0169) \|

	Matches the 9B end-to-end Nunchaku number (FP4 254 ms/step = 2.69× vs bf16 684; full pipeline
	1.29 s/img @1024²/4-step, 24.95 GB). Rank tax (real W4A4 kernel): r64→r128 ≈ 11% (2.75×→2.49×)
	for the 0.0364→0.0303 quality gain — rank is a quality knob with a small, real speed cost.

	End-to-end (real Nunchaku kernel, klein-4B Linears swapped to NVFP4 W4A4, rest bf16): bf16 →
	NVFP4 is 1.24× @512px / 1.18× @1024px with ~28% less VRAM (16.9→12.0 GB @512). Far below the
	2.5× per-layer GEMM because attention is bf16 + O(N²) and dominates (more so at 1024px), VAE +
	text-encode are fixed bf16 overhead, and a Linears-only swap is unfused. The 9B Nunchaku FULL pipeline
	hit 2.69× precisely because it ALSO fuses attention/quant and is more GEMM-heavy — so the lever for a
	real 4B speedup is the fully-fused NunchakuFlux2 model (fused attention), not just quantized Linears.
	Rank tax is negligible end-to-end here (low-rank branch tiny vs the rest). `scripts/24_nunchaku_e2e_speed.py`.

	★ FULLY-FUSED end-to-end (the real lever): converting klein-4B to the `NunchakuFlux2Transformer2DModel`
	fused path (`fused_qkv_norm_rottary` + `attention_fp16` + W4A4 GEMM, no bf16 SDPA) gives **1.76× @512px
	and 1.87× @1024px (per-step 2.2×/2.29×**, VRAM −26%: 18.6→13.7 GB @1024). This is the real number —
	fusing attention lifts the Linears-only 1.2× to ~1.9×, and unlike the Linears-only swap it improves
	with resolution. The transformer step (2.3×) ≈ the per-layer GEMM (2.5×); end-to-end is capped by the
	4B's VAE+text-encode (~30% of its small total). `scripts/25_fused_4b_speed.py` (dummy weights — speed is
	value-independent; built from source via `/workspace/build_nunchaku`). Deployable single artifact (fast + correct weights) — ✅ DONE.

	★ DEPLOYABLE NVFP4-fused klein-4B (correct weights + real kernel). Wrote our own NVFP4 weight
	exporter (`flux2distill/nunchaku_export.py`): per-Linear `SVD low-rank + iterative refine → NVFP4
	residual (E2M1 codes + per-group-16 FP8 wscales + per-tensor alpha, wcscales=1)`, packed into
	Nunchaku's MMA layout via their `pack_weight`/`pack_micro_scale`/`pack_lowrank_weight`. **Convention
	validated** (`scripts/26`): with pre-quantized acts the real kernel reconstructs our intended weight to
	2.99%. `scripts/27_convert_full_4b.py` converts all 120 fused Linears (handling the qkv fusion +
	the single-block `to_qkv_mlp_proj`/`to_out` splits), loads them into `NunchakuFlux2Transformer2DModel`,
	and generates correct images on the real FP4 kernel — teacher-indistinguishable on the text
	("THE OPEN PAGE" legible) and hand probes (montages `outputs/nvfp4/deploy/`). This is the full
	deployable model: NVFP4 quality (≈0.0303) + the fused ~1.9× speed + −26% VRAM in one artifact.
	(NB the fused packed-rotary path requires batch=1 generation.)

	The quality↔speed fork (both deployable on Blackwell, pick one):
	- Speed: NVFP4 W4A4 r128 → 0.0303 @ 2.49× (real Nunchaku kernel, loads today). r64 → 0.0364 @ 2.75×.
	- Quality: NVFP4-W + FP8-A r128 → 0.0169, but FP8 compute caps at ~2× bf16 (FP8 tensor cores
	are half FP4 throughput) and has no fused kernel (1.2× measurable today). The tradeoff is physical.

	Why INT4 is the wrong format on Blackwell (hardware, sourced): the 5th-gen tensor cores natively do
	FP4/FP6/FP8/INT8/BF16/… but dropped INT4 (Turing/Ampere/Ada had INT4 IMMA; sm_120 doesn't). Nunchaku
	ships INT4 for Turing/Ampere/Ada and NVFP4 for Blackwell; `get_precision()` returns `fp4` here.
	Forcing `svdq-int4` on this card → 1677 ms/step (slower than bf16). **So INT4 W4A4 stays the deployable
	format for the huge RTX 20/30/40 base; NVFP4 is for Blackwell/B200** — complementary, one per generation.

	## ⚠️ 2026-06-10 — box rebuilt AGAIN; eval axis SHIFTED; old numbers not comparable

	The box was rebuilt (A100-80GB → RTX PRO 4500 Blackwell 32 GB, system python, torch
	2.12+cu130, transformers 5.9→5.10.2, diffusers git Jun-1→Jun-10). Re-evaluating the UNCHANGED
	grid-best checkpoint (`r64 plain+refine`, zero missing/unexpected keys on load) gives
	0.0325 (vel-relerr 0.1661) vs the recorded 0.0446 (0.1896) — a −27% instrument shift,
	NOT a model change. Cause: the 16-sample eval ran through different SDPA kernels, different
	Qwen3 prompt embeddings (transformers bump), and possibly different seeded σ/noise draws; with
	N=16 that easily moves the mean. Rule: numbers are only comparable WITHIN one box era. The
	4×3 grid + mechanism-ablation tables below are the OLD axis (still internally consistent);
	every 2026-06-10+ experiment is compared against same-box re-evals. Re-anchored baseline:
	r64 plain+refine (α=0.5, smoothed) = 0.0325 (`outputs/recheck_r64_plain_refine`, montages
	in its `eval/`).

	Montage read of that baseline on the new box (teacher\|quant, 8 probes): texture/scene/large-text
	probes (storefront, lake, neon, spiderweb, fisherman) ≈ indistinguishable; the quant **misses the
	counting probe (2 fried eggs vs the teacher's correct 3) and mangles the hand probe** (folded/
	merged fingers); chalkboard small-print is gibberish in BOTH (teacher limitation). Counting + hand
	are the sensitive probes to watch across the SMOOTH=0 ablation.

	### ✅ SMOOTH=0 ablation (2026-06-10, new box) — CONFIRMED: drop SmoothQuant at W4A8

	The #1 queued experiment, run at 3 ranks (plain+refine, 300-calib, all new-axis; each SMOOTH=0
	build compared against a same-box re-eval of its smoothed α=0.5 twin). Dirs:
	`outputs/abl_c300_r{16,32,64}_plain_refine_nosmooth` vs `outputs/recheck_r{16,32,64}_plain_refine`.

	\| rank \| smoothed α=0.5 (re-eval) \| SMOOTH=0 \| Δ eval-loss \| wrecon mean (sm → ns) \| ns wrecon max \|
	\|------\|--------------------------\|--------------\|-------------\|------------------------\|---------------\|
	\| 16 \| 0.0405 / rel 0.1855 \| 0.0348 / 0.1719 \| −14.1% \| 0.1251 → 0.1041 \| 0.1379 \|
	\| 32 \| 0.0362 / rel 0.1753 \| 0.0331 / 0.1675 \| −8.6% \| 0.1193 → 0.1003 \| 0.1286 \|
	\| 64 \| 0.0325 / rel 0.1661 \| 0.0297 / 0.1588 \| −8.6% \| 0.1110 → 0.0944 \| 0.1124 \|

	Findings:
	1. SMOOTH=0 wins at every rank — new overall best: r64 plain+refine no-smooth = 0.0297.
	`SMOOTH=0` is now the DEFAULT for all W4A8 builds.
	2. The win grows as rank shrinks (−8.6% at r64/r32 → −14.1% at r16): the SVD branch was
	partly compensating smoothing damage, and with less capacity more damage shows through.
	Confirms the mechanism from the rank-0 ablation (smoothing widens the 4-bit weight spread).
	3. No-smooth buys ~one rank tier for free: ns-r16 (0.0348) beats smoothed-r32 (0.0362);
	ns-r32 (0.0331) ≈ smoothed-r64 (0.0325). I.e. same quality at ~4% more compression.
	4. Visual (8-probe montages): the hand probe — the most quant-sensitive — is visibly FIXED
	vs the smoothed baseline (plausible spread fingers at all 3 ranks vs mangled/merged at the
	smoothed r64 baseline). The counting probe (3 eggs) is still missed by EVERY quant (2 eggs),
	smoothed or not, all ranks — a rank/smooth-independent semantic drift. Large in-image text
	is always preserved.
	5. wrecon improves ~15% at every rank and the worst layer drops to 0.11–0.14 (vs 0.15–0.26
	smoothed) — exactly the predicted mechanism.

	### ✅ α sweep (2026-06-10) — the dial has NO good setting; low α is the WORST
	α ∈ {off, 0.1, 0.25, 0.5} × r{32,64}, plain+refine, 300-calib, new axis
	(dirs `outputs/abl_c300_r{32,64}_plain_refine_a{10,25}`; off/0.5 cells from above):

	\| α \| r64 eval-loss \| r32 eval-loss \| r64 wrecon \| r32 wrecon \|
	\|---------\|---------------\|---------------\|------------\|------------\|
	\| off \| 0.0297 \| 0.0331 \| 0.0944 \| 0.1003 \|
	\| 0.1 \| 0.0380 ⚠ worst\| 0.0408 ⚠ worst\| 0.0999 \| 0.1057 \|
	\| 0.25 \| 0.0317 \| 0.0349 \| 0.1007 \| 0.1069 \|
	\| 0.5 \| 0.0325 \| 0.0362 \| 0.1110 \| 0.1193 \|

	Identical ordering at both ranks (off < 0.25 < 0.5 ≪ 0.1) — a replicated U-shape:
	1. No α beats off — every dose of migration hurts at W4A8; `SMOOTH=0` is the permanent default.
	2. α=0.1 is the worst point, not the safest: at low α the factor ≈ `1/max\|W\|^(1-α)` (the
	weight-equalizing extreme). wrecon barely moves (0.0999 vs 0.1007 at r64) yet eval-loss jumps
	+20% — the rescale rotates residual quant error into output-relevant directions. The sharpest
	demonstration yet that weight-recon ≠ the eval objective; never tune α by wrecon.
	3. Montages agree: the hand probe (cleanly rendered at SMOOTH=0) regresses to merged fingers at
	α=0.1. Counting (3 eggs) fails everywhere, α-independent.
	4. Determinism note: a repeat eval of an unchanged cell reproduces its loss exactly (0.0331 →
	0.0331), and rebuilt cells reproduce wrecon bit-for-bit — cell deltas on this box are real,
	not run-to-run noise.

	The old-axis grid below retains its per-knob story (refine reliable, whitening unstable at
	300-calib) but all its α=0.5 absolute numbers are superseded by no-smooth.

	### W4A4 ablation (2026-06-10, 3 cells) — the smoothing flip CONFIRMED; naive A4 not viable yet
	All plain+refine, 300-calib, new axis. ABITS=4 (per-token dynamic 4-bit acts — same scheme as
	A8, just 16 levels). Dirs `outputs/abl_c300_r{64,128}_w4a4_plain_refine_{nosmooth,a50}`.

	\| cell \| eval-loss \| vel-relerr \| wrecon mean \| vs W4A8 twin \|
	\|----------------------------\|-----------\|------------\|-------------\|--------------\|
	\| r64 SMOOTH=0 (A8 recipe) \| 0.5103 \| 0.6582 \| 0.0944 \| 0.0297 → 17× worse \|
	\| r64 α=0.5 \| 0.3885 \| 0.5743 \| 0.1111 \| smoothing helps +24% \|
	\| r128 α=0.5 \| 0.3060\| 0.5097 \| 0.0992 \| rank helps +21% \|

	Findings:
	1. The smoothing flip is real and symmetric. At A8 smoothing hurt −9%; at A4 it helps +24%.
	Its value is purely a function of activation bit-width — at A4 the per-token 4-bit quant is
	destroyed by channel outliers (one outlier forces the whole token's scale → small channels
	round to 0), which is exactly what migration mitigates. Mechanism fully closed.
	2. Rank matters more at A4 than A8 (r64→r128: −21%): the low-rank branch runs on FULL-
	precision activations, so extra rank routes more computation around the 4-bit bottleneck —
	at A4 rank is an activation shield, not just a weight-outlier absorber.
	3. Naive W4A4 is NOT viable: best cell 0.3060 is ~10× the W4A8 best (0.0297). Visually:
	no-smooth A4 shatters in-image text ("THE OPEN PAGE" → "PINE OPEEN I AAGE"); α=0.5 restores
	readable text; r128 renders the storefront cleanly — but composition/anatomy stay broken
	(hand probe: two wrong-gesture hands). (No cross-axis comparison to the old surgery numbers.)
	4. Next lever (queued): per-group activation quant — give each group of ~64 channels its own
	dynamic scale (the weight side already does this; Atom/QServe-style). Attacks the outlier
	problem per-token/per-group with zero weight-side cost — expected to beat the whole α dial.

	### W4A4 α-up sweep (2026-06-10, 3 more cells) — A4 has an INTERIOR optimum at α≈0.75
	plain+refine, 300-calib, new axis. Dirs `outputs/abl_c300_r{64,128}_w4a4_plain_refine_a{75,100}`.

	\| α \| r64 eval-loss \| r128 eval-loss \| wrecon mean/max (r64 · r128) \|
	\|-------\|---------------\|----------------\|-----------------------------------\|
	\| off \| 0.5103 \| — \| 0.0944/0.1124 · — \|
	\| 0.5 \| 0.3885 \| 0.3060 \| 0.1111/0.1486 · 0.0992/0.1344 \|
	\| 0.75 \| 0.2819 \| 0.2080 ★ W4A4 best \| 0.1345/0.2585 · 0.1168/0.1960 \|
	\| 1.0 \| — \| 0.2397 \| — · 0.1469/0.3423 \|

	Findings:
	1. *α ≈ 0.75 at (per-token) A4** — the curve descends to 0.75 then turns at 1.0: full
	flattening wrecks the weights (worst-layer 0.34) faster than it relieves the activations,
	even with r128+refine absorbing. Campaign symmetry: the optimal α tracks the bottleneck
	(A8 → off; per-token A4 → 0.75; nothing in between wins at either).
	2. W4A4 improved 0.5103 → 0.2080 (−59%) via smoothing+rank alone — still ~7× the W4A8 champion
	(0.0297). Visual at the best cell: in-image text nearly clean (one corrupted glyph), coherent
	compositions return, but counting/gesture still wrong. Not deployable yet.
	3. ⚠ Paper-spec correction (from re-reading SVDQuant): the paper's W4A4 uses **per-GROUP
	activations** (group 64, like its weights; NVFP4 = group 16) — our per-token activations
	reproduce their baselines (ViDiT-Q/MixDQ, which fail catastrophically exactly like our
	cells). So per-group acts isn't an enhancement, it's the missing piece of the actual
	SVDQuant W4A4 recipe → implemented below; step change confirmed.

	### ✅ W4A4 per-group activations (2026-06-10) — the fix; W4A4 becomes viable
	Implemented `a_group` (per-group dynamic act scales along channels; `AGROUP` env on `scripts/12`,
	recorded in `quant_config.json`). Unit test: one 60× outlier channel → per-token A4 rel-err 0.59,
	g64 0.11 (5×). 2×2 grid {r64,r128} × {SMOOTH=0, α=0.5}, AGROUP=64, plain+refine, 300-calib.
	Dirs `outputs/abl_c300_r{64,128}_w4a4g64_{nosmooth,a50}`.

	\| W4A4 g64 acts \| SMOOTH=0 \| α=0.5 \| (per-token, for scale) \|
	\|---------------\|-----------\|--------\|-------------------------------\|
	\| r64 \| 0.0742\| 0.0759 \| 0.5103 (ns) / 0.3885 (a50) \|
	\| r128 \| 0.0610 ★ W4A4 best \| 0.0620 \| — / 0.3060 (a50) \|

	Findings:
	1. Per-group acts (g64) is THE W4A4 fix: −85% (0.5103 → 0.0742 at r64-ns) — far beyond
	everything the α/rank campaign bought combined. W4A4 best now 0.0610 (r128-ns), ~2× the
	W4A8 champion (0.0297) instead of 17×.
	2. With per-group acts, smoothing is dead weight again (a50 slightly worse at both ranks) —
	the outlier problem belongs to quantizer granularity, not weight-side migration, at every
	bit-width. The W4A4 recipe converges to the same clean form as W4A8:
	plain SVD + refine, NO smoothing, per-group W and A, rank to taste.
	3. This recipe is calibration-free (no smooth → absmax unused; no whiten → Gram unused;
	acts dynamic) — calib size/content is irrelevant to the current champions; it only matters
	if whitening returns or for future QAT. (NB the `jasperai/monet` calib set is NOT paintings —
	it's diverse photographic data, 260/400 captions mention text/signs — so content-narrowness
	was never a confound; "Monet" is just the dataset name.)
	4. Qualitative (24 montages reviewed): per-group fixes gross text destruction, anatomy collapse
	and composition smearing entirely; residual A4 damage = symbolic precision — single-glyph
	text errors ("PAGE"→"PAYE"/"PACE"), counting flicker (2 vs 3 eggs, seed-fragile, non-monotone
	in loss), slight identity drift. no-smooth cells are visually cleanest (hands track the metric).
	5. Act group-size ladder (unit test): per-token 0.60 → g128 0.17 → g64 0.13 → g32 0.10 → g16
	0.077 — g16 (NVFP4's native group) is the queued next knob (sim-only for INT4; deployable
	as NVFP4 on Blackwell).

	## ACTIVE TRACK — W4A8 SVDQuant (fake-quant quality study; A100-era grid below)

	Quantize all 100 block Linears: smooth → SVD rank-r low-rank (16-bit) → 4-bit residual +
	8-bit per-token activations. `smaller` = quantized-weight bytes vs bf16 (a real low-bit
	kernel realizes this; fake-quant here measures quality only — no wall/flop yet on A100).

	Three composable knobs (all W4A8, α=0.5, group=64, same fixed eval batch):
	- plain = SVD of smoothed weight (base SVDQuant paper's headline derivation)
	- whiten = activation-aware SVD minimizing OUTPUT error ‖X̂(Ŵ−L)‖ (ASVD/SVD-LLM idea; our add)
	- +refine = iterative low-rank refinement, re-fit L to absorb 4-bit quant error (SVDQuant §4.2)

	### FULL 4×3 GRID (2026-06-01) — every method × every rank, fixed 300-img calib
	Closes the old "L-shape, not a grid" gap: each cell built one-at-a-time on the SAME 300-image
	calib from `data/monet_cache` (latents), so all 12 are directly comparable. eval velocity-loss
	(lower=closer to teacher); `wrecon` = mean weight-recon rel-err. Dirs: `outputs/abl_c300_r{R}_{variant}`.

	\| rank \| smaller \| plain \| plain+refine \| whiten \| whiten+refine \|
	\|------\|---------\|-------\|--------------\|--------\|-------------------\|
	\| 16 \| 3.67× \| 0.0620 \| 0.0655 \| 0.0656 \| 0.0556 \|
	\| 32 \| 3.59× \| 0.0586 \| 0.0574 \| 0.0545 \| 0.0476 \|
	\| 64 \| 3.43× \| 0.0487 \| 0.0446 ← grid best \| 0.0588 \| 0.0451 \|

	Best per rank: r16 whiten+refine 0.0556; r32 whiten+refine 0.0476; r64 plain+refine 0.0446
	(whiten+refine 0.0451 ≈ tie). Overall best = r64 plain+refine 0.0446 @ 3.43×.

	#### Full per-cell metrics (all 12, real measured)
	`eval-loss` = held-out velocity-matching loss · `vel-relerr` = velocity field rel-L2 vs teacher ·
	`wrecon` = mean weight-recon rel-err (100 layers) · `orecon` = mean output-recon rel-err (what
	whitening optimizes; only computed for the whitened cells).

	\| rank \| variant \| smaller \| eval-loss \| vel-relerr \| wrecon-relerr \| orecon-relerr \|
	\|------\|---------------\|---------\|-----------\|------------\|---------------\|---------------\|
	\| 16 \| plain \| 3.67× \| 0.0620 \| 0.2235 \| 0.1269 \| — \|
	\| 16 \| plain+refine \| 3.67× \| 0.0655 \| 0.2297 \| 0.1251 \| — \|
	\| 16 \| whiten \| 3.67× \| 0.0656 \| 0.2299 \| 0.1290 \| 0.0733 \|
	\| 16 \| whiten+refine \| 3.67× \| 0.0556\| 0.2117 \| 0.1273 \| 0.0680 \|
	\| 32 \| plain \| 3.59× \| 0.0586 \| 0.2174 \| 0.1224 \| — \|
	\| 32 \| plain+refine \| 3.59× \| 0.0574 \| 0.2151 \| 0.1193 \| — \|
	\| 32 \| whiten \| 3.59× \| 0.0545 \| 0.2095 \| 0.1257 \| 0.0719 \|
	\| 32 \| whiten+refine \| 3.59× \| 0.0476\| 0.1959 \| 0.1226 \| 0.0646 \|
	\| 64 \| plain \| 3.43× \| 0.0487 \| 0.1980 \| 0.1163 \| — \|
	\| 64 \| plain+refine \| 3.43× \| 0.0446\| 0.1896 \| 0.1110 \| — \|
	\| 64 \| whiten \| 3.43× \| 0.0588 \| 0.2177 \| 0.1209 \| 0.0695 \|
	\| 64 \| whiten+refine \| 3.43× \| 0.0451 \| 0.1907 \| 0.1155 \| 0.0595 \|

	All three metrics track together (lower wrecon/orecon ↔ lower eval-loss) within a rank, with the
	notable exception that **plain+refine attains the lowest wrecon at each rank yet only the lowest
	eval-loss at r64** — confirming weight-recon ≠ the eval objective (refine minimizes weight error,
	which only aligns with the velocity loss once rank is high enough). Build/eval logs: `tmp/abl_r_.{build,eval}.log`.

	Findings (these OVERTURN the prior L-shape conclusion that "each upgrade compounds"):
	1. Refine is the reliable workhorse — it helps at every rank/metric EXCEPT r16-plain (0.0620→0.0655,
	where minimizing weight error overfits and drifts from the output-optimal point). Every rank's
	best variant uses refine.
	2. Whitening ALONE is unreliable at 300-calib — non-monotonic in rank: hurts r16 (0.0656>0.0620),
	helps r32 (0.0545<0.0586), hurts r64 (0.0588>0.0487). It's overfitting the noisy 300-image Gram;
	at high rank it fits more directions to bad stats and generalizes worse than plain.
	3. Strong whiten×refine interaction — refine runs IN the whitened metric, correcting whitening's
	overfitting. At r16 neither upgrade alone helps yet together −10% (0.0620→0.0556). At r32 they stack
	to the row best (0.0476). At r64 whitening adds nothing over plain+refine (0.0451 vs 0.0446).
	4. At high rank, skip whitening — r64 plain+refine (0.0446) beats/ties everything and needs no
	Gram (simpler + faster build). Whitening only earns its keep at moderate rank (r32) or paired w/ refine.
	5. Sweet spot is a choice: r64 plain+refine 0.0446 @ 3.43× (max quality, simplest) vs
	r32 whiten+refine 0.0476 @ 3.59× (a bit more compression). Both ~4–5× below the surgery frontier (0.231).

	These 300-calib numbers land close to the prior 100-calib report (r32 wr 0.0476 vs 0.0494; r64 wr 0.0451
	vs 0.0454) but the per-knob story is different — whitening's instability is the headline. Montages
	for all 12 cells (8 probe prompts each) under `outputs/abl_c300_*/eval/`.

	OPEN — whitening needs a higher-calib re-test. Its non-monotonic, often-harmful behavior at 300
	calib is consistent with Gram under-estimation. The deferred 2000-image calib re-sweep (plan.md §5)
	is a follow-up: does whitening become reliably beneficial with richer activation statistics?
	Full methodology + math (whitening, Cholesky→eigh, refinement) in `report/QUANT_REPORT.md`.

	### Mechanism ablation — what each piece buys (2026-06-01) · ⚠️ SmoothQuant HURTS at W4A8
	Stripping the pipeline down to isolate each mechanism (300-calib, same eval). `smaller` ≈ 3.76× for
	the rank-0 rows (no low-rank bytes).

	\| config \| eval-loss \| vel-relerr \| wrecon mean / max \| note \|
	\|------------------------------------------\|-----------\|------------\|-----------------------\|------\|
	\| RTN W4A8 (no smooth, no SVD, s=1) \| 0.0573\| 0.2149 \| 0.1112 / 0.1504 \| naive floor — yet beats the next two \|
	\| SmoothQuant W4A8 (rank-0, α=0.5, no SVD) \| 0.0729 \| 0.2424 \| 0.1356 / 0.2633 \| smoothing makes it WORSE \|
	\| + SVD rank-16 plain (α=0.5) \| 0.0620 \| 0.2235 \| 0.1269 / 0.198 \| grid `plain` r16 \|
	\| + SVD rank-64 plain \| 0.0487 \| 0.1980 \| 0.1163 / 0.155 \| grid `plain` r64 \|
	\| grid best: r64 plain+refine (α=0.5) \| 0.0446 \| 0.1896 \| 0.1110 / — \| best so far \|

	Headline finding: SmoothQuant at α=0.5 is actively harmful for W4A8. Removing it (RTN, s=1) beats
	SmoothQuant rank-0 by −21% (0.0729→0.0573) AND beats the smoothed SVD cells at r16/r32. Mechanism:
	SmoothQuant migrates outliers out of activations into weights — a win only when activations are the
	hard part (low-bit A4). At W4A8 the 8-bit activations are already easy, so migration buys nothing
	there and widens the weight distribution, making the 4-bit weight quant harder (worst-layer wrecon
	0.15→0.26). **Implication: the entire α=0.5 grid is mis-tuned — the SVD branch was partly
	compensating for smoothing damage. Re-running with no-smooth / low-α should beat 0.0446.** Next: run the
	best config with `SMOOTH=0`, then an α sweep {0, 0.25, 0.5}. Knob: `SMOOTH=0` env on `scripts/12`
	(`s=1`). Montages: `outputs/abl_c300_r0_nosvd{,_nosmooth}/eval/`.

	## SHELVED TRACK — block surgery (depth-prune single blocks → surrogates → distill)

	\| config \| params \| smaller \| wall \| flop \| eval-loss \| status \|
	\|------------------------------------------\|---------\|---------\|-------\|-------\|-----------\|--------\|
	\| teacher 4B (baseline) \| 3.876B \| — \| 1.00x \| 1.00x \| — \| reference \|
	\| v1 per-token drop-12 (SVD-energy) \| 2.441B \| 37% \| 1.45x \| 1.64x \| — \| COLLAPSED (non-functional) \|
	\| per-token drop-6 (importance) \| 3.158B \| 19% \| 1.19x \| 1.24x \| 0.308 \| ok, soft \|
	\| linattn drop-6 (simple elu+1) \| 3.177B \| 18% \| 1.15x \| 1.23x \| 0.253 \| ok \|
	\| linattn drop-6 +RoPE+conv+warmstart \| 3.177B \| 18% \| 1.15x \| 1.23x \| 0.231 \| BEST QUALITY \|
	\| linattn drop-8 +focused+FFN(all) \| 2.995B \| 23% \| 1.20x \| 1.28x \| 0.269 \| best colors/local \|
	\| linattn drop-10 mixed (4 FFN+6 light) \| 2.737B \| 29% \| 1.26x \| 1.42x \| 0.322 \| KILLED ~step200 \|

Xet Storage Details

Size:: 30 kB
Xet hash:: ffda64d2dcb49d26d7e27f25bd2c9438de928a13f110d0d7ad97c1d056279bbf

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.