Buckets:

Mercity
/

FluxDistill

Files

xet

Mercity/FluxDistill / RESULTS.md

Pranav2748

15 days ago

preview code

download

raw

30 kB

Results — FLUX.2 klein 4B -> compressed student

Eval = held-out velocity-matching loss vs teacher (lower=closer; same fixed first-16 batch across all rows, so quant and surgery sit on ONE axis). wall=measured s/img @512/4-step batch-1 on A100; flop=estimated transformer-only ratio.

★★ 2026-06-14 — NVFP4 HEAD-TO-HEAD (image-space metrics, N=512) — separate axis

A matched, paired comparison on N=512 MJHQ-30k prompts, 512px, 4 steps, guidance 1.0, seed=idx, using image-space fidelity-to-teacher metrics (LPIPS/PSNR/FID vs the teacher's own outputs) + a PickScore/CLIP semantic check. This is NOT the velocity-loss axis above — do not compare the numbers across sections; compare only within this table. All models share the same Qwen3 TE + VAE (only the transformer quant varies). Full write-up: report/HEADTOHEAD_klein4b_nvfp4.md; raw numbers outputs/eval/h2h/metrics.json; speed outputs/nvfp4/benchmark_headtohead.json.

model	bits	PickScore↑	CLIP↑	LPIPS↓	PSNR↑	FID↓(teacher)	FID(real)	real-kernel speed
A teacher (bf16)	16	21.64	30.95	—	—	—	89.6	1.0× (0.464 s/img @512)
D plain NVFP4 r0	W4A4	21.62	30.95	0.2076	17.44	39.31	88.5	— (fake-q)
ours r128 (fake-q)	W4A4	21.61	30.99	0.1732	18.50	33.37	89.6	— (fake-q)
C ours r128 (REAL Nunchaku kernel)	W4A4	21.62	30.94	0.1668	18.71	33.54	90.1	1.76×@512 / 1.87×@1024, 12.6 GB
E BFL official FP8	W8A8	21.65	30.94	0.0798	23.02	18.81	89.6	— (needs TensorRT)
Δ low-rank branch (r0→r128)		≈0	≈0	−19.7%	+1.27 dB	−14.7%	~flat	—

Findings. (1) The SVDQuant low-rank branch helps at NVFP4 W4A4 — r0→r128: LPIPS −19.7%, PSNR +1.27 dB, FID-vs-teacher −14.7% (fake-q-vs-fake-q ablation agrees: −16.6% / +1.06 dB / −15.1%); reproduces the prior N=256 result at N=512. (2) The real Nunchaku FP4 kernel reproduces (slightly beats) the fake-quant (LPIPS 0.167 vs 0.173) → the gain holds on the deployed model. (3) No semantic loss — PickScore/CLIP flat across all incl. teacher. (4) BFL official FP8 is closest to teacher (8-bit; high-precision/low-speedup point) vs our 4-bit/2.5×-kernel point. (5) FID-vs-real is ~~flat (~~88–90 incl. teacher) — tracks the klein-vs-MJHQ style gap, not quant; reported, not discriminating. (6) BFL official NVFP4 (model B) could not be run — cutlass tensor-core swizzled layout, needs BFL's TensorRT runtime (see report/HEADTOHEAD_klein4b_nvfp4.md §5); D (our plain r0) is the labeled controlled "plain NVFP4" stand-in, E is the real BFL baseline. No BFL number was fabricated.

★ 2026-06-13 — NVFP4 (Blackwell-native FP4) + first REAL kernel speed (Nunchaku)

Two things landed this day: (1) NVFP4 added to our fake-quant SVDQuant (flux2distill/svdquant.py) and swept; (2) the first real low-bit kernel speed numbers, by calling Nunchaku's compiled NVFP4 W4A4 GEMM directly. NVFP4 beats INT4 on BOTH quality and speed — it's the format for this box (and any Blackwell / RTX 50 / B200). Same eval axis as the 2026-06-10 cells below.

NVFP4 format: E2M1 elements (the 8 magnitudes {0,.5,1,1.5,2,3,4,6}·sign) + group-16 blocks + FP8(E4M3) block scales, applied to the 4-bit residual weights (and optionally activations). The low-rank branch stays bf16 — it is the high-precision error-/outlier-absorbing path. Knobs: WFMT={int,nvfp4}, AFMT={int,nvfp4,fp8} on scripts/12 (driver scripts/run_nvfp4_cell.sh).

Quality — NVFP4 sweep (klein-4B, plain+refine, no-smooth, same held-out velocity loss)

#	weights	acts	rank	wrecon mean	eval-loss	vs reference
1	NVFP4 g16	NVFP4 g16	32	0.0865	0.0390	—
2	NVFP4 g16	NVFP4 g16	64	0.0817	0.0364	INT4 W4A4 r64 = 0.0742 (2.0× worse)
3	NVFP4 g16	NVFP4 g16	128	0.0742	0.0303	INT4 W4A4 r128 = 0.0610 (2.0×); ≈ INT4 W4A8 champ 0.0297
4	NVFP4 g16	FP8 E4M3	64	0.0817	0.0204	INT4 W4A8 r64 = 0.0297 (prior overall best)
5	NVFP4 g16	FP8 E4M3	128	0.0742	0.0169 ★	−43% vs the 0.0297 INT4 W4A8 champion

NEW OVERALL QUANT CHAMPION: NVFP4-weights + FP8-acts, r128 = 0.0169. Dirs outputs/nvfp4_*.

Findings: (1) NVFP4 weights ≫ INT4 weights — ~2× lower loss at matched rank; driver is the finer group-16 + E2M1 float grid (unit test: outlier column 0.064 vs INT4-g64 0.115). (2) NVFP4 W4A4 r128 (0.0303) matches the INT4 W4A8 champion (0.0297) while keeping activations at 4-bit — full W4A4 at 8-bit-act quality. (3) FP8 acts buy more quality (0.0169) but cost speed (below). Visually the champions are teacher-indistinguishable on the text + hand probes (the quant-sensitive ones).

Speed — REAL kernels, klein-4B layer shapes, T=1536, RTX PRO 4500 Blackwell (sm_120)

scripts/23_nvfp4_kernel_bench.py calls Nunchaku's compiled svdq_gemm_w4a4_cuda (NVFP4 W4A4) at each of klein-4B's 5 distinct Linear shapes, summed over all 100 Linears. FP8 row = torch._scaled_mm (NOT Nunchaku — a proxy for the W4+FP8-act variant, which has no fused kernel on Blackwell).

path	ms/step	speedup vs bf16	deploys
bf16	73.8	1.00×	baseline
NVFP4 W4A4 r64 (real Nunchaku kernel)	26.8	2.75×	cell 2 (0.0364)
NVFP4 W4A4 r128 (real Nunchaku kernel)	29.7	2.49×	cell 3 (0.0303)
FP8 proxy r64	60.3	1.22×	cell 4 (0.0204)
FP8 proxy r128	60.8	1.21×	cell 5 (0.0169)

Matches the 9B end-to-end Nunchaku number (FP4 254 ms/step = 2.69× vs bf16 684; full pipeline 1.29 s/img @1024²/4-step, 24.95 GB). Rank tax (real W4A4 kernel): r64→r128 ≈ 11% (2.75×→2.49×) for the 0.0364→0.0303 quality gain — rank is a quality knob with a small, real speed cost.

End-to-end (real Nunchaku kernel, klein-4B Linears swapped to NVFP4 W4A4, rest bf16): bf16 → NVFP4 is 1.24× @512px / 1.18× @1024px with ~28% less VRAM (16.9→12.0 GB @512). Far below the 2.5× per-layer GEMM because attention is bf16 + O(N²) and dominates (more so at 1024px), VAE + text-encode are fixed bf16 overhead, and a Linears-only swap is unfused. The 9B Nunchaku FULL pipeline hit 2.69× precisely because it ALSO fuses attention/quant and is more GEMM-heavy — so the lever for a real 4B speedup is the fully-fused NunchakuFlux2 model (fused attention), not just quantized Linears. Rank tax is negligible end-to-end here (low-rank branch tiny vs the rest). scripts/24_nunchaku_e2e_speed.py.

★ FULLY-FUSED end-to-end (the real lever): converting klein-4B to the NunchakuFlux2Transformer2DModel fused path (fused_qkv_norm_rottary + attention_fp16 + W4A4 GEMM, no bf16 SDPA) gives 1.76× @512px and 1.87× @1024px (per-step 2.2×/2.29×, VRAM −26%: 18.6→13.7 GB @1024). This is the real number — fusing attention lifts the Linears-only 1.2× to ~~1.9×, and unlike the Linears-only swap it improves with resolution. The transformer step (2.3×) ≈ the per-layer GEMM (2.5×); end-to-end is capped by the 4B's VAE+text-encode (~~30% of its small total). scripts/25_fused_4b_speed.py (dummy weights — speed is value-independent; built from source via /workspace/build_nunchaku). Deployable single artifact (fast + correct weights) — ✅ DONE.

★ DEPLOYABLE NVFP4-fused klein-4B (correct weights + real kernel). Wrote our own NVFP4 weight exporter (flux2distill/nunchaku_export.py): per-Linear SVD low-rank + iterative refine → NVFP4 residual (E2M1 codes + per-group-16 FP8 wscales + per-tensor alpha, wcscales=1), packed into Nunchaku's MMA layout via their pack_weight/pack_micro_scale/pack_lowrank_weight. Convention validated (scripts/26): with pre-quantized acts the real kernel reconstructs our intended weight to 2.99%. scripts/27_convert_full_4b.py converts all 120 fused Linears (handling the qkv fusion + the single-block to_qkv_mlp_proj/to_out splits), loads them into NunchakuFlux2Transformer2DModel, and generates correct images on the real FP4 kernel — teacher-indistinguishable on the text ("THE OPEN PAGE" legible) and hand probes (montages outputs/nvfp4/deploy/). This is the full deployable model: NVFP4 quality (≈0.0303) + the fused ~1.9× speed + −26% VRAM in one artifact. (NB the fused packed-rotary path requires batch=1 generation.)

The quality↔speed fork (both deployable on Blackwell, pick one):

Speed: NVFP4 W4A4 r128 → 0.0303 @ 2.49× (real Nunchaku kernel, loads today). r64 → 0.0364 @ 2.75×.
Quality: NVFP4-W + FP8-A r128 → 0.0169, but FP8 compute caps at ~2× bf16 (FP8 tensor cores are half FP4 throughput) and has no fused kernel (1.2× measurable today). The tradeoff is physical.

Why INT4 is the wrong format on Blackwell (hardware, sourced): the 5th-gen tensor cores natively do FP4/FP6/FP8/INT8/BF16/… but dropped INT4 (Turing/Ampere/Ada had INT4 IMMA; sm_120 doesn't). Nunchaku ships INT4 for Turing/Ampere/Ada and NVFP4 for Blackwell; get_precision() returns fp4 here. Forcing svdq-int4 on this card → 1677 ms/step (slower than bf16). So INT4 W4A4 stays the deployable format for the huge RTX 20/30/40 base; NVFP4 is for Blackwell/B200 — complementary, one per generation.

⚠️ 2026-06-10 — box rebuilt AGAIN; eval axis SHIFTED; old numbers not comparable

The box was rebuilt (A100-80GB → RTX PRO 4500 Blackwell 32 GB, system python, torch 2.12+cu130, transformers 5.9→5.10.2, diffusers git Jun-1→Jun-10). Re-evaluating the UNCHANGED grid-best checkpoint (r64 plain+refine, zero missing/unexpected keys on load) gives 0.0325 (vel-relerr 0.1661) vs the recorded 0.0446 (0.1896) — a −27% instrument shift, NOT a model change. Cause: the 16-sample eval ran through different SDPA kernels, different Qwen3 prompt embeddings (transformers bump), and possibly different seeded σ/noise draws; with N=16 that easily moves the mean. Rule: numbers are only comparable WITHIN one box era. The 4×3 grid + mechanism-ablation tables below are the OLD axis (still internally consistent); every 2026-06-10+ experiment is compared against same-box re-evals. Re-anchored baseline: r64 plain+refine (α=0.5, smoothed) = 0.0325 (outputs/recheck_r64_plain_refine, montages in its eval/).

Montage read of that baseline on the new box (teacher|quant, 8 probes): texture/scene/large-text probes (storefront, lake, neon, spiderweb, fisherman) ≈ indistinguishable; the quant misses the counting probe (2 fried eggs vs the teacher's correct 3) and mangles the hand probe (folded/ merged fingers); chalkboard small-print is gibberish in BOTH (teacher limitation). Counting + hand are the sensitive probes to watch across the SMOOTH=0 ablation.

✅ SMOOTH=0 ablation (2026-06-10, new box) — CONFIRMED: drop SmoothQuant at W4A8

The #1 queued experiment, run at 3 ranks (plain+refine, 300-calib, all new-axis; each SMOOTH=0 build compared against a same-box re-eval of its smoothed α=0.5 twin). Dirs: outputs/abl_c300_r{16,32,64}_plain_refine_nosmooth vs outputs/recheck_r{16,32,64}_plain_refine.

rank	smoothed α=0.5 (re-eval)	SMOOTH=0	Δ eval-loss	wrecon mean (sm → ns)	ns wrecon max
16	0.0405 / rel 0.1855	0.0348 / 0.1719	−14.1%	0.1251 → 0.1041	0.1379
32	0.0362 / rel 0.1753	0.0331 / 0.1675	−8.6%	0.1193 → 0.1003	0.1286
64	0.0325 / rel 0.1661	0.0297 / 0.1588	−8.6%	0.1110 → 0.0944	0.1124

Findings:

SMOOTH=0 wins at every rank — new overall best: r64 plain+refine no-smooth = 0.0297. SMOOTH=0 is now the DEFAULT for all W4A8 builds.
The win grows as rank shrinks (−8.6% at r64/r32 → −14.1% at r16): the SVD branch was partly compensating smoothing damage, and with less capacity more damage shows through. Confirms the mechanism from the rank-0 ablation (smoothing widens the 4-bit weight spread).
No-smooth buys ~one rank tier for free: ns-r16 (0.0348) beats smoothed-r32 (0.0362); ns-r32 (0.0331) ≈ smoothed-r64 (0.0325). I.e. same quality at ~4% more compression.
Visual (8-probe montages): the hand probe — the most quant-sensitive — is visibly FIXED vs the smoothed baseline (plausible spread fingers at all 3 ranks vs mangled/merged at the smoothed r64 baseline). The counting probe (3 eggs) is still missed by EVERY quant (2 eggs), smoothed or not, all ranks — a rank/smooth-independent semantic drift. Large in-image text is always preserved.
wrecon improves ~15% at every rank and the worst layer drops to 0.11–0.14 (vs 0.15–0.26 smoothed) — exactly the predicted mechanism.

✅ α sweep (2026-06-10) — the dial has NO good setting; low α is the WORST

α ∈ {off, 0.1, 0.25, 0.5} × r{32,64}, plain+refine, 300-calib, new axis (dirs outputs/abl_c300_r{32,64}_plain_refine_a{10,25}; off/0.5 cells from above):

α	r64 eval-loss	r32 eval-loss	r64 wrecon	r32 wrecon
off	0.0297	0.0331	0.0944	0.1003
0.1	0.0380 ⚠ worst	0.0408 ⚠ worst	0.0999	0.1057
0.25	0.0317	0.0349	0.1007	0.1069
0.5	0.0325	0.0362	0.1110	0.1193

Identical ordering at both ranks (off < 0.25 < 0.5 ≪ 0.1) — a replicated U-shape:

No α beats off — every dose of migration hurts at W4A8; SMOOTH=0 is the permanent default.
α=0.1 is the worst point, not the safest: at low α the factor ≈ 1/max|W|^(1-α) (the weight-equalizing extreme). wrecon barely moves (0.0999 vs 0.1007 at r64) yet eval-loss jumps +20% — the rescale rotates residual quant error into output-relevant directions. The sharpest demonstration yet that weight-recon ≠ the eval objective; never tune α by wrecon.
Montages agree: the hand probe (cleanly rendered at SMOOTH=0) regresses to merged fingers at α=0.1. Counting (3 eggs) fails everywhere, α-independent.
Determinism note: a repeat eval of an unchanged cell reproduces its loss exactly (0.0331 → 0.0331), and rebuilt cells reproduce wrecon bit-for-bit — cell deltas on this box are real, not run-to-run noise.

The old-axis grid below retains its per-knob story (refine reliable, whitening unstable at 300-calib) but all its α=0.5 absolute numbers are superseded by no-smooth.

W4A4 ablation (2026-06-10, 3 cells) — the smoothing flip CONFIRMED; naive A4 not viable yet

All plain+refine, 300-calib, new axis. ABITS=4 (per-token dynamic 4-bit acts — same scheme as A8, just 16 levels). Dirs outputs/abl_c300_r{64,128}_w4a4_plain_refine_{nosmooth,a50}.

cell	eval-loss	vel-relerr	wrecon mean	vs W4A8 twin
r64 SMOOTH=0 (A8 recipe)	0.5103	0.6582	0.0944	0.0297 → 17× worse
r64 α=0.5	0.3885	0.5743	0.1111	smoothing helps +24%
r128 α=0.5	0.3060	0.5097	0.0992	rank helps +21%

Findings:

The smoothing flip is real and symmetric. At A8 smoothing hurt −9%; at A4 it helps +24%. Its value is purely a function of activation bit-width — at A4 the per-token 4-bit quant is destroyed by channel outliers (one outlier forces the whole token's scale → small channels round to 0), which is exactly what migration mitigates. Mechanism fully closed.
Rank matters more at A4 than A8 (r64→r128: −21%): the low-rank branch runs on FULL- precision activations, so extra rank routes more computation around the 4-bit bottleneck — at A4 rank is an activation shield, not just a weight-outlier absorber.
Naive W4A4 is NOT viable: best cell 0.3060 is ~10× the W4A8 best (0.0297). Visually: no-smooth A4 shatters in-image text ("THE OPEN PAGE" → "PINE OPEEN I AAGE"); α=0.5 restores readable text; r128 renders the storefront cleanly — but composition/anatomy stay broken (hand probe: two wrong-gesture hands). (No cross-axis comparison to the old surgery numbers.)
Next lever (queued): per-group activation quant — give each group of ~64 channels its own dynamic scale (the weight side already does this; Atom/QServe-style). Attacks the outlier problem per-token/per-group with zero weight-side cost — expected to beat the whole α dial.

W4A4 α-up sweep (2026-06-10, 3 more cells) — A4 has an INTERIOR optimum at α≈0.75

plain+refine, 300-calib, new axis. Dirs outputs/abl_c300_r{64,128}_w4a4_plain_refine_a{75,100}.

α	r64 eval-loss	r128 eval-loss	wrecon mean/max (r64 · r128)
off	0.5103	—	0.0944/0.1124 · —
0.5	0.3885	0.3060	0.1111/0.1486 · 0.0992/0.1344
0.75	0.2819	0.2080 ★ W4A4 best	0.1345/0.2585 · 0.1168/0.1960
1.0	—	0.2397	— · 0.1469/0.3423

Findings:

α ≈ 0.75 at (per-token) A4* — the curve descends to 0.75 then turns at 1.0: full flattening wrecks the weights (worst-layer 0.34) faster than it relieves the activations, even with r128+refine absorbing. Campaign symmetry: the optimal α tracks the bottleneck (A8 → off; per-token A4 → 0.75; nothing in between wins at either).
W4A4 improved 0.5103 → 0.2080 (−59%) via smoothing+rank alone — still ~7× the W4A8 champion (0.0297). Visual at the best cell: in-image text nearly clean (one corrupted glyph), coherent compositions return, but counting/gesture still wrong. Not deployable yet.
⚠ Paper-spec correction (from re-reading SVDQuant): the paper's W4A4 uses per-GROUP activations (group 64, like its weights; NVFP4 = group 16) — our per-token activations reproduce their baselines (ViDiT-Q/MixDQ, which fail catastrophically exactly like our cells). So per-group acts isn't an enhancement, it's the missing piece of the actual SVDQuant W4A4 recipe → implemented below; step change confirmed.

✅ W4A4 per-group activations (2026-06-10) — the fix; W4A4 becomes viable

Implemented a_group (per-group dynamic act scales along channels; AGROUP env on scripts/12, recorded in quant_config.json). Unit test: one 60× outlier channel → per-token A4 rel-err 0.59, g64 0.11 (5×). 2×2 grid {r64,r128} × {SMOOTH=0, α=0.5}, AGROUP=64, plain+refine, 300-calib. Dirs outputs/abl_c300_r{64,128}_w4a4g64_{nosmooth,a50}.

W4A4 g64 acts	SMOOTH=0	α=0.5	(per-token, for scale)
r64	0.0742	0.0759	0.5103 (ns) / 0.3885 (a50)
r128	0.0610 ★ W4A4 best	0.0620	— / 0.3060 (a50)

Findings:

Per-group acts (g64) is THE W4A4 fix: −85% (0.5103 → 0.0742 at r64-ns) — far beyond everything the α/rank campaign bought combined. W4A4 best now 0.0610 (r128-ns), ~2× the W4A8 champion (0.0297) instead of 17×.
With per-group acts, smoothing is dead weight again (a50 slightly worse at both ranks) — the outlier problem belongs to quantizer granularity, not weight-side migration, at every bit-width. The W4A4 recipe converges to the same clean form as W4A8: plain SVD + refine, NO smoothing, per-group W and A, rank to taste.
This recipe is calibration-free (no smooth → absmax unused; no whiten → Gram unused; acts dynamic) — calib size/content is irrelevant to the current champions; it only matters if whitening returns or for future QAT. (NB the jasperai/monet calib set is NOT paintings — it's diverse photographic data, 260/400 captions mention text/signs — so content-narrowness was never a confound; "Monet" is just the dataset name.)
Qualitative (24 montages reviewed): per-group fixes gross text destruction, anatomy collapse and composition smearing entirely; residual A4 damage = symbolic precision — single-glyph text errors ("PAGE"→"PAYE"/"PACE"), counting flicker (2 vs 3 eggs, seed-fragile, non-monotone in loss), slight identity drift. no-smooth cells are visually cleanest (hands track the metric).
Act group-size ladder (unit test): per-token 0.60 → g128 0.17 → g64 0.13 → g32 0.10 → g16 0.077 — g16 (NVFP4's native group) is the queued next knob (sim-only for INT4; deployable as NVFP4 on Blackwell).

ACTIVE TRACK — W4A8 SVDQuant (fake-quant quality study; A100-era grid below)

Quantize all 100 block Linears: smooth → SVD rank-r low-rank (16-bit) → 4-bit residual + 8-bit per-token activations. smaller = quantized-weight bytes vs bf16 (a real low-bit kernel realizes this; fake-quant here measures quality only — no wall/flop yet on A100).

Three composable knobs (all W4A8, α=0.5, group=64, same fixed eval batch):

plain = SVD of smoothed weight (base SVDQuant paper's headline derivation)
whiten = activation-aware SVD minimizing OUTPUT error ‖X̂(Ŵ−L)‖ (ASVD/SVD-LLM idea; our add)
+refine = iterative low-rank refinement, re-fit L to absorb 4-bit quant error (SVDQuant §4.2)

FULL 4×3 GRID (2026-06-01) — every method × every rank, fixed 300-img calib

Closes the old "L-shape, not a grid" gap: each cell built one-at-a-time on the SAME 300-image calib from data/monet_cache (latents), so all 12 are directly comparable. eval velocity-loss (lower=closer to teacher); wrecon = mean weight-recon rel-err. Dirs: outputs/abl_c300_r{R}_{variant}.

rank	smaller	plain	plain+refine	whiten	whiten+refine
16	3.67×	0.0620	0.0655	0.0656	0.0556
32	3.59×	0.0586	0.0574	0.0545	0.0476
64	3.43×	0.0487	0.0446 ← grid best	0.0588	0.0451

Best per rank: r16 whiten+refine 0.0556; r32 whiten+refine 0.0476; r64 plain+refine 0.0446 (whiten+refine 0.0451 ≈ tie). Overall best = r64 plain+refine 0.0446 @ 3.43×.

Full per-cell metrics (all 12, real measured)

eval-loss = held-out velocity-matching loss · vel-relerr = velocity field rel-L2 vs teacher · wrecon = mean weight-recon rel-err (100 layers) · orecon = mean output-recon rel-err (what whitening optimizes; only computed for the whitened cells).

rank	variant	smaller	eval-loss	vel-relerr	wrecon-relerr	orecon-relerr
16	plain	3.67×	0.0620	0.2235	0.1269	—
16	plain+refine	3.67×	0.0655	0.2297	0.1251	—
16	whiten	3.67×	0.0656	0.2299	0.1290	0.0733
16	whiten+refine	3.67×	0.0556	0.2117	0.1273	0.0680
32	plain	3.59×	0.0586	0.2174	0.1224	—
32	plain+refine	3.59×	0.0574	0.2151	0.1193	—
32	whiten	3.59×	0.0545	0.2095	0.1257	0.0719
32	whiten+refine	3.59×	0.0476	0.1959	0.1226	0.0646
64	plain	3.43×	0.0487	0.1980	0.1163	—
64	plain+refine	3.43×	0.0446	0.1896	0.1110	—
64	whiten	3.43×	0.0588	0.2177	0.1209	0.0695
64	whiten+refine	3.43×	0.0451	0.1907	0.1155	0.0595

All three metrics track together (lower wrecon/orecon ↔ lower eval-loss) within a rank, with the notable exception that plain+refine attains the lowest wrecon at each rank yet only the lowest eval-loss at r64 — confirming weight-recon ≠ the eval objective (refine minimizes weight error, which only aligns with the velocity loss once rank is high enough). Build/eval logs: tmp/abl_r*_*.{build,eval}.log.

Findings (these OVERTURN the prior L-shape conclusion that "each upgrade compounds"):

Refine is the reliable workhorse — it helps at every rank/metric EXCEPT r16-plain (0.0620→0.0655, where minimizing weight error overfits and drifts from the output-optimal point). Every rank's best variant uses refine.
Whitening ALONE is unreliable at 300-calib — non-monotonic in rank: hurts r16 (0.0656>0.0620), helps r32 (0.0545<0.0586), hurts r64 (0.0588>0.0487). It's overfitting the noisy 300-image Gram; at high rank it fits more directions to bad stats and generalizes worse than plain.
Strong whiten×refine interaction — refine runs IN the whitened metric, correcting whitening's overfitting. At r16 neither upgrade alone helps yet together −10% (0.0620→0.0556). At r32 they stack to the row best (0.0476). At r64 whitening adds nothing over plain+refine (0.0451 vs 0.0446).
At high rank, skip whitening — r64 plain+refine (0.0446) beats/ties everything and needs no Gram (simpler + faster build). Whitening only earns its keep at moderate rank (r32) or paired w/ refine.
Sweet spot is a choice: r64 plain+refine 0.0446 @ 3.43× (max quality, simplest) vs r32 whiten+refine 0.0476 @ 3.59× (a bit more compression). Both ~4–5× below the surgery frontier (0.231).

These 300-calib numbers land close to the prior 100-calib report (r32 wr 0.0476 vs 0.0494; r64 wr 0.0451 vs 0.0454) but the per-knob story is different — whitening's instability is the headline. Montages for all 12 cells (8 probe prompts each) under outputs/abl_c300_*/eval/.

OPEN — whitening needs a higher-calib re-test. Its non-monotonic, often-harmful behavior at 300 calib is consistent with Gram under-estimation. The deferred 2000-image calib re-sweep (plan.md §5) is a follow-up: does whitening become reliably beneficial with richer activation statistics? Full methodology + math (whitening, Cholesky→eigh, refinement) in report/QUANT_REPORT.md.

Mechanism ablation — what each piece buys (2026-06-01) · ⚠️ SmoothQuant HURTS at W4A8

Stripping the pipeline down to isolate each mechanism (300-calib, same eval). smaller ≈ 3.76× for the rank-0 rows (no low-rank bytes).

config	eval-loss	vel-relerr	wrecon mean / max	note
RTN W4A8 (no smooth, no SVD, s=1)	0.0573	0.2149	0.1112 / 0.1504	naive floor — yet beats the next two
SmoothQuant W4A8 (rank-0, α=0.5, no SVD)	0.0729	0.2424	0.1356 / 0.2633	smoothing makes it WORSE
+ SVD rank-16 plain (α=0.5)	0.0620	0.2235	0.1269 / 0.198	grid `plain` r16
+ SVD rank-64 plain	0.0487	0.1980	0.1163 / 0.155	grid `plain` r64
grid best: r64 plain+refine (α=0.5)	0.0446	0.1896	0.1110 / —	best so far

Headline finding: SmoothQuant at α=0.5 is actively harmful for W4A8. Removing it (RTN, s=1) beats SmoothQuant rank-0 by −21% (0.0729→0.0573) AND beats the smoothed SVD cells at r16/r32. Mechanism: SmoothQuant migrates outliers out of activations into weights — a win only when activations are the hard part (low-bit A4). At W4A8 the 8-bit activations are already easy, so migration buys nothing there and widens the weight distribution, making the 4-bit weight quant harder (worst-layer wrecon 0.15→0.26). Implication: the entire α=0.5 grid is mis-tuned — the SVD branch was partly compensating for smoothing damage. Re-running with no-smooth / low-α should beat 0.0446. Next: run the best config with SMOOTH=0, then an α sweep {0, 0.25, 0.5}. Knob: SMOOTH=0 env on scripts/12 (s=1). Montages: outputs/abl_c300_r0_nosvd{,_nosmooth}/eval/.

SHELVED TRACK — block surgery (depth-prune single blocks → surrogates → distill)

config	params	smaller	wall	flop	eval-loss	status
teacher 4B (baseline)	3.876B	—	1.00x	1.00x	—	reference
v1 per-token drop-12 (SVD-energy)	2.441B	37%	1.45x	1.64x	—	COLLAPSED (non-functional)
per-token drop-6 (importance)	3.158B	19%	1.19x	1.24x	0.308	ok, soft
linattn drop-6 (simple elu+1)	3.177B	18%	1.15x	1.23x	0.253	ok
linattn drop-6 +RoPE+conv+warmstart	3.177B	18%	1.15x	1.23x	0.231	BEST QUALITY
linattn drop-8 +focused+FFN(all)	2.995B	23%	1.20x	1.28x	0.269	best colors/local
linattn drop-10 mixed (4 FFN+6 light)	2.737B	29%	1.26x	1.42x	0.322	KILLED ~step200

Xet Storage Details

Size:: 30 kB
Xet hash:: ffda64d2dcb49d26d7e27f25bd2c9438de928a13f110d0d7ad97c1d056279bbf

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.