Buckets:

Mercity/FluxDistill / RESULTS.md
Pranav2748's picture
|
download
raw
30 kB

Results — FLUX.2 klein 4B -> compressed student

Eval = held-out velocity-matching loss vs teacher (lower=closer; same fixed first-16 batch across all rows, so quant and surgery sit on ONE axis). wall=measured s/img @512/4-step batch-1 on A100; flop=estimated transformer-only ratio.

★★ 2026-06-14 — NVFP4 HEAD-TO-HEAD (image-space metrics, N=512) — separate axis

A matched, paired comparison on N=512 MJHQ-30k prompts, 512px, 4 steps, guidance 1.0, seed=idx, using image-space fidelity-to-teacher metrics (LPIPS/PSNR/FID vs the teacher's own outputs) + a PickScore/CLIP semantic check. This is NOT the velocity-loss axis above — do not compare the numbers across sections; compare only within this table. All models share the same Qwen3 TE + VAE (only the transformer quant varies). Full write-up: report/HEADTOHEAD_klein4b_nvfp4.md; raw numbers outputs/eval/h2h/metrics.json; speed outputs/nvfp4/benchmark_headtohead.json.

model bits PickScore↑ CLIP↑ LPIPS↓ PSNR↑ FID↓(teacher) FID(real) real-kernel speed
A teacher (bf16) 16 21.64 30.95 89.6 1.0× (0.464 s/img @512)
D plain NVFP4 r0 W4A4 21.62 30.95 0.2076 17.44 39.31 88.5 — (fake-q)
ours r128 (fake-q) W4A4 21.61 30.99 0.1732 18.50 33.37 89.6 — (fake-q)
C ours r128 (REAL Nunchaku kernel) W4A4 21.62 30.94 0.1668 18.71 33.54 90.1 1.76×@512 / 1.87×@1024, 12.6 GB
E BFL official FP8 W8A8 21.65 30.94 0.0798 23.02 18.81 89.6 — (needs TensorRT)
Δ low-rank branch (r0→r128) ≈0 ≈0 −19.7% +1.27 dB −14.7% ~flat

Findings. (1) The SVDQuant low-rank branch helps at NVFP4 W4A4 — r0→r128: LPIPS −19.7%, PSNR +1.27 dB, FID-vs-teacher −14.7% (fake-q-vs-fake-q ablation agrees: −16.6% / +1.06 dB / −15.1%); reproduces the prior N=256 result at N=512. (2) The real Nunchaku FP4 kernel reproduces (slightly beats) the fake-quant (LPIPS 0.167 vs 0.173) → the gain holds on the deployed model. (3) No semantic loss — PickScore/CLIP flat across all incl. teacher. (4) BFL official FP8 is closest to teacher (8-bit; high-precision/low-speedup point) vs our 4-bit/2.5×-kernel point. (5) FID-vs-real is flat (88–90 incl. teacher) — tracks the klein-vs-MJHQ style gap, not quant; reported, not discriminating. (6) BFL official NVFP4 (model B) could not be run — cutlass tensor-core swizzled layout, needs BFL's TensorRT runtime (see report/HEADTOHEAD_klein4b_nvfp4.md §5); D (our plain r0) is the labeled controlled "plain NVFP4" stand-in, E is the real BFL baseline. No BFL number was fabricated.

★ 2026-06-13 — NVFP4 (Blackwell-native FP4) + first REAL kernel speed (Nunchaku)

Two things landed this day: (1) NVFP4 added to our fake-quant SVDQuant (flux2distill/svdquant.py) and swept; (2) the first real low-bit kernel speed numbers, by calling Nunchaku's compiled NVFP4 W4A4 GEMM directly. NVFP4 beats INT4 on BOTH quality and speed — it's the format for this box (and any Blackwell / RTX 50 / B200). Same eval axis as the 2026-06-10 cells below.

NVFP4 format: E2M1 elements (the 8 magnitudes {0,.5,1,1.5,2,3,4,6}·sign) + group-16 blocks + FP8(E4M3) block scales, applied to the 4-bit residual weights (and optionally activations). The low-rank branch stays bf16 — it is the high-precision error-/outlier-absorbing path. Knobs: WFMT={int,nvfp4}, AFMT={int,nvfp4,fp8} on scripts/12 (driver scripts/run_nvfp4_cell.sh).

Quality — NVFP4 sweep (klein-4B, plain+refine, no-smooth, same held-out velocity loss)

# weights acts rank wrecon mean eval-loss vs reference
1 NVFP4 g16 NVFP4 g16 32 0.0865 0.0390
2 NVFP4 g16 NVFP4 g16 64 0.0817 0.0364 INT4 W4A4 r64 = 0.0742 (2.0× worse)
3 NVFP4 g16 NVFP4 g16 128 0.0742 0.0303 INT4 W4A4 r128 = 0.0610 (2.0×); ≈ INT4 W4A8 champ 0.0297
4 NVFP4 g16 FP8 E4M3 64 0.0817 0.0204 INT4 W4A8 r64 = 0.0297 (prior overall best)
5 NVFP4 g16 FP8 E4M3 128 0.0742 0.0169 −43% vs the 0.0297 INT4 W4A8 champion

NEW OVERALL QUANT CHAMPION: NVFP4-weights + FP8-acts, r128 = 0.0169. Dirs outputs/nvfp4_*.

Findings: (1) NVFP4 weights ≫ INT4 weights — ~2× lower loss at matched rank; driver is the finer group-16 + E2M1 float grid (unit test: outlier column 0.064 vs INT4-g64 0.115). (2) NVFP4 W4A4 r128 (0.0303) matches the INT4 W4A8 champion (0.0297) while keeping activations at 4-bit — full W4A4 at 8-bit-act quality. (3) FP8 acts buy more quality (0.0169) but cost speed (below). Visually the champions are teacher-indistinguishable on the text + hand probes (the quant-sensitive ones).

Speed — REAL kernels, klein-4B layer shapes, T=1536, RTX PRO 4500 Blackwell (sm_120)

scripts/23_nvfp4_kernel_bench.py calls Nunchaku's compiled svdq_gemm_w4a4_cuda (NVFP4 W4A4) at each of klein-4B's 5 distinct Linear shapes, summed over all 100 Linears. FP8 row = torch._scaled_mm (NOT Nunchaku — a proxy for the W4+FP8-act variant, which has no fused kernel on Blackwell).

path ms/step speedup vs bf16 deploys
bf16 73.8 1.00× baseline
NVFP4 W4A4 r64 (real Nunchaku kernel) 26.8 2.75× cell 2 (0.0364)
NVFP4 W4A4 r128 (real Nunchaku kernel) 29.7 2.49× cell 3 (0.0303)
FP8 proxy r64 60.3 1.22× cell 4 (0.0204)
FP8 proxy r128 60.8 1.21× cell 5 (0.0169)

Matches the 9B end-to-end Nunchaku number (FP4 254 ms/step = 2.69× vs bf16 684; full pipeline 1.29 s/img @1024²/4-step, 24.95 GB). Rank tax (real W4A4 kernel): r64→r128 ≈ 11% (2.75×→2.49×) for the 0.0364→0.0303 quality gain — rank is a quality knob with a small, real speed cost.

End-to-end (real Nunchaku kernel, klein-4B Linears swapped to NVFP4 W4A4, rest bf16): bf16 → NVFP4 is 1.24× @512px / 1.18× @1024px with ~28% less VRAM (16.9→12.0 GB @512). Far below the 2.5× per-layer GEMM because attention is bf16 + O(N²) and dominates (more so at 1024px), VAE + text-encode are fixed bf16 overhead, and a Linears-only swap is unfused. The 9B Nunchaku FULL pipeline hit 2.69× precisely because it ALSO fuses attention/quant and is more GEMM-heavy — so the lever for a real 4B speedup is the fully-fused NunchakuFlux2 model (fused attention), not just quantized Linears. Rank tax is negligible end-to-end here (low-rank branch tiny vs the rest). scripts/24_nunchaku_e2e_speed.py.

★ FULLY-FUSED end-to-end (the real lever): converting klein-4B to the NunchakuFlux2Transformer2DModel fused path (fused_qkv_norm_rottary + attention_fp16 + W4A4 GEMM, no bf16 SDPA) gives 1.76× @512px and 1.87× @1024px (per-step 2.2×/2.29×, VRAM −26%: 18.6→13.7 GB @1024). This is the real number — fusing attention lifts the Linears-only 1.2× to 1.9×, and unlike the Linears-only swap it improves with resolution. The transformer step (2.3×) ≈ the per-layer GEMM (2.5×); end-to-end is capped by the 4B's VAE+text-encode (30% of its small total). scripts/25_fused_4b_speed.py (dummy weights — speed is value-independent; built from source via /workspace/build_nunchaku). Deployable single artifact (fast + correct weights) — ✅ DONE.

★ DEPLOYABLE NVFP4-fused klein-4B (correct weights + real kernel). Wrote our own NVFP4 weight exporter (flux2distill/nunchaku_export.py): per-Linear SVD low-rank + iterative refine → NVFP4 residual (E2M1 codes + per-group-16 FP8 wscales + per-tensor alpha, wcscales=1), packed into Nunchaku's MMA layout via their pack_weight/pack_micro_scale/pack_lowrank_weight. Convention validated (scripts/26): with pre-quantized acts the real kernel reconstructs our intended weight to 2.99%. scripts/27_convert_full_4b.py converts all 120 fused Linears (handling the qkv fusion + the single-block to_qkv_mlp_proj/to_out splits), loads them into NunchakuFlux2Transformer2DModel, and generates correct images on the real FP4 kernel — teacher-indistinguishable on the text ("THE OPEN PAGE" legible) and hand probes (montages outputs/nvfp4/deploy/). This is the full deployable model: NVFP4 quality (≈0.0303) + the fused ~1.9× speed + −26% VRAM in one artifact. (NB the fused packed-rotary path requires batch=1 generation.)

The quality↔speed fork (both deployable on Blackwell, pick one):

  • Speed: NVFP4 W4A4 r128 → 0.0303 @ 2.49× (real Nunchaku kernel, loads today). r64 → 0.0364 @ 2.75×.
  • Quality: NVFP4-W + FP8-A r128 → 0.0169, but FP8 compute caps at ~2× bf16 (FP8 tensor cores are half FP4 throughput) and has no fused kernel (1.2× measurable today). The tradeoff is physical.

Why INT4 is the wrong format on Blackwell (hardware, sourced): the 5th-gen tensor cores natively do FP4/FP6/FP8/INT8/BF16/… but dropped INT4 (Turing/Ampere/Ada had INT4 IMMA; sm_120 doesn't). Nunchaku ships INT4 for Turing/Ampere/Ada and NVFP4 for Blackwell; get_precision() returns fp4 here. Forcing svdq-int4 on this card → 1677 ms/step (slower than bf16). So INT4 W4A4 stays the deployable format for the huge RTX 20/30/40 base; NVFP4 is for Blackwell/B200 — complementary, one per generation.

⚠️ 2026-06-10 — box rebuilt AGAIN; eval axis SHIFTED; old numbers not comparable

The box was rebuilt (A100-80GB → RTX PRO 4500 Blackwell 32 GB, system python, torch 2.12+cu130, transformers 5.9→5.10.2, diffusers git Jun-1→Jun-10). Re-evaluating the UNCHANGED grid-best checkpoint (r64 plain+refine, zero missing/unexpected keys on load) gives 0.0325 (vel-relerr 0.1661) vs the recorded 0.0446 (0.1896) — a −27% instrument shift, NOT a model change. Cause: the 16-sample eval ran through different SDPA kernels, different Qwen3 prompt embeddings (transformers bump), and possibly different seeded σ/noise draws; with N=16 that easily moves the mean. Rule: numbers are only comparable WITHIN one box era. The 4×3 grid + mechanism-ablation tables below are the OLD axis (still internally consistent); every 2026-06-10+ experiment is compared against same-box re-evals. Re-anchored baseline: r64 plain+refine (α=0.5, smoothed) = 0.0325 (outputs/recheck_r64_plain_refine, montages in its eval/).

Montage read of that baseline on the new box (teacher|quant, 8 probes): texture/scene/large-text probes (storefront, lake, neon, spiderweb, fisherman) ≈ indistinguishable; the quant misses the counting probe (2 fried eggs vs the teacher's correct 3) and mangles the hand probe (folded/ merged fingers); chalkboard small-print is gibberish in BOTH (teacher limitation). Counting + hand are the sensitive probes to watch across the SMOOTH=0 ablation.

✅ SMOOTH=0 ablation (2026-06-10, new box) — CONFIRMED: drop SmoothQuant at W4A8

The #1 queued experiment, run at 3 ranks (plain+refine, 300-calib, all new-axis; each SMOOTH=0 build compared against a same-box re-eval of its smoothed α=0.5 twin). Dirs: outputs/abl_c300_r{16,32,64}_plain_refine_nosmooth vs outputs/recheck_r{16,32,64}_plain_refine.

rank smoothed α=0.5 (re-eval) SMOOTH=0 Δ eval-loss wrecon mean (sm → ns) ns wrecon max
16 0.0405 / rel 0.1855 0.0348 / 0.1719 −14.1% 0.1251 → 0.1041 0.1379
32 0.0362 / rel 0.1753 0.0331 / 0.1675 −8.6% 0.1193 → 0.1003 0.1286
64 0.0325 / rel 0.1661 0.0297 / 0.1588 −8.6% 0.1110 → 0.0944 0.1124

Findings:

  1. SMOOTH=0 wins at every rank — new overall best: r64 plain+refine no-smooth = 0.0297. SMOOTH=0 is now the DEFAULT for all W4A8 builds.
  2. The win grows as rank shrinks (−8.6% at r64/r32 → −14.1% at r16): the SVD branch was partly compensating smoothing damage, and with less capacity more damage shows through. Confirms the mechanism from the rank-0 ablation (smoothing widens the 4-bit weight spread).
  3. No-smooth buys ~one rank tier for free: ns-r16 (0.0348) beats smoothed-r32 (0.0362); ns-r32 (0.0331) ≈ smoothed-r64 (0.0325). I.e. same quality at ~4% more compression.
  4. Visual (8-probe montages): the hand probe — the most quant-sensitive — is visibly FIXED vs the smoothed baseline (plausible spread fingers at all 3 ranks vs mangled/merged at the smoothed r64 baseline). The counting probe (3 eggs) is still missed by EVERY quant (2 eggs), smoothed or not, all ranks — a rank/smooth-independent semantic drift. Large in-image text is always preserved.
  5. wrecon improves ~15% at every rank and the worst layer drops to 0.11–0.14 (vs 0.15–0.26 smoothed) — exactly the predicted mechanism.

✅ α sweep (2026-06-10) — the dial has NO good setting; low α is the WORST

α ∈ {off, 0.1, 0.25, 0.5} × r{32,64}, plain+refine, 300-calib, new axis (dirs outputs/abl_c300_r{32,64}_plain_refine_a{10,25}; off/0.5 cells from above):

α r64 eval-loss r32 eval-loss r64 wrecon r32 wrecon
off 0.0297 0.0331 0.0944 0.1003
0.1 0.0380 ⚠ worst 0.0408 ⚠ worst 0.0999 0.1057
0.25 0.0317 0.0349 0.1007 0.1069
0.5 0.0325 0.0362 0.1110 0.1193

Identical ordering at both ranks (off < 0.25 < 0.5 ≪ 0.1) — a replicated U-shape:

  1. No α beats off — every dose of migration hurts at W4A8; SMOOTH=0 is the permanent default.
  2. α=0.1 is the worst point, not the safest: at low α the factor ≈ 1/max|W|^(1-α) (the weight-equalizing extreme). wrecon barely moves (0.0999 vs 0.1007 at r64) yet eval-loss jumps +20% — the rescale rotates residual quant error into output-relevant directions. The sharpest demonstration yet that weight-recon ≠ the eval objective; never tune α by wrecon.
  3. Montages agree: the hand probe (cleanly rendered at SMOOTH=0) regresses to merged fingers at α=0.1. Counting (3 eggs) fails everywhere, α-independent.
  4. Determinism note: a repeat eval of an unchanged cell reproduces its loss exactly (0.0331 → 0.0331), and rebuilt cells reproduce wrecon bit-for-bit — cell deltas on this box are real, not run-to-run noise.

The old-axis grid below retains its per-knob story (refine reliable, whitening unstable at 300-calib) but all its α=0.5 absolute numbers are superseded by no-smooth.

W4A4 ablation (2026-06-10, 3 cells) — the smoothing flip CONFIRMED; naive A4 not viable yet

All plain+refine, 300-calib, new axis. ABITS=4 (per-token dynamic 4-bit acts — same scheme as A8, just 16 levels). Dirs outputs/abl_c300_r{64,128}_w4a4_plain_refine_{nosmooth,a50}.

cell eval-loss vel-relerr wrecon mean vs W4A8 twin
r64 SMOOTH=0 (A8 recipe) 0.5103 0.6582 0.0944 0.0297 → 17× worse
r64 α=0.5 0.3885 0.5743 0.1111 smoothing helps +24%
r128 α=0.5 0.3060 0.5097 0.0992 rank helps +21%

Findings:

  1. The smoothing flip is real and symmetric. At A8 smoothing hurt −9%; at A4 it helps +24%. Its value is purely a function of activation bit-width — at A4 the per-token 4-bit quant is destroyed by channel outliers (one outlier forces the whole token's scale → small channels round to 0), which is exactly what migration mitigates. Mechanism fully closed.
  2. Rank matters more at A4 than A8 (r64→r128: −21%): the low-rank branch runs on FULL- precision activations, so extra rank routes more computation around the 4-bit bottleneck — at A4 rank is an activation shield, not just a weight-outlier absorber.
  3. Naive W4A4 is NOT viable: best cell 0.3060 is ~10× the W4A8 best (0.0297). Visually: no-smooth A4 shatters in-image text ("THE OPEN PAGE" → "PINE OPEEN I AAGE"); α=0.5 restores readable text; r128 renders the storefront cleanly — but composition/anatomy stay broken (hand probe: two wrong-gesture hands). (No cross-axis comparison to the old surgery numbers.)
  4. Next lever (queued): per-group activation quant — give each group of ~64 channels its own dynamic scale (the weight side already does this; Atom/QServe-style). Attacks the outlier problem per-token/per-group with zero weight-side cost — expected to beat the whole α dial.

W4A4 α-up sweep (2026-06-10, 3 more cells) — A4 has an INTERIOR optimum at α≈0.75

plain+refine, 300-calib, new axis. Dirs outputs/abl_c300_r{64,128}_w4a4_plain_refine_a{75,100}.

α r64 eval-loss r128 eval-loss wrecon mean/max (r64 · r128)
off 0.5103 0.0944/0.1124 · —
0.5 0.3885 0.3060 0.1111/0.1486 · 0.0992/0.1344
0.75 0.2819 0.2080 ★ W4A4 best 0.1345/0.2585 · 0.1168/0.1960
1.0 0.2397 — · 0.1469/0.3423

Findings:

  1. α ≈ 0.75 at (per-token) A4* — the curve descends to 0.75 then turns at 1.0: full flattening wrecks the weights (worst-layer 0.34) faster than it relieves the activations, even with r128+refine absorbing. Campaign symmetry: the optimal α tracks the bottleneck (A8 → off; per-token A4 → 0.75; nothing in between wins at either).
  2. W4A4 improved 0.5103 → 0.2080 (−59%) via smoothing+rank alone — still ~7× the W4A8 champion (0.0297). Visual at the best cell: in-image text nearly clean (one corrupted glyph), coherent compositions return, but counting/gesture still wrong. Not deployable yet.
  3. ⚠ Paper-spec correction (from re-reading SVDQuant): the paper's W4A4 uses per-GROUP activations (group 64, like its weights; NVFP4 = group 16) — our per-token activations reproduce their baselines (ViDiT-Q/MixDQ, which fail catastrophically exactly like our cells). So per-group acts isn't an enhancement, it's the missing piece of the actual SVDQuant W4A4 recipe → implemented below; step change confirmed.

✅ W4A4 per-group activations (2026-06-10) — the fix; W4A4 becomes viable

Implemented a_group (per-group dynamic act scales along channels; AGROUP env on scripts/12, recorded in quant_config.json). Unit test: one 60× outlier channel → per-token A4 rel-err 0.59, g64 0.11 (5×). 2×2 grid {r64,r128} × {SMOOTH=0, α=0.5}, AGROUP=64, plain+refine, 300-calib. Dirs outputs/abl_c300_r{64,128}_w4a4g64_{nosmooth,a50}.

W4A4 g64 acts SMOOTH=0 α=0.5 (per-token, for scale)
r64 0.0742 0.0759 0.5103 (ns) / 0.3885 (a50)
r128 0.0610 ★ W4A4 best 0.0620 — / 0.3060 (a50)

Findings:

  1. Per-group acts (g64) is THE W4A4 fix: −85% (0.5103 → 0.0742 at r64-ns) — far beyond everything the α/rank campaign bought combined. W4A4 best now 0.0610 (r128-ns), ~2× the W4A8 champion (0.0297) instead of 17×.
  2. With per-group acts, smoothing is dead weight again (a50 slightly worse at both ranks) — the outlier problem belongs to quantizer granularity, not weight-side migration, at every bit-width. The W4A4 recipe converges to the same clean form as W4A8: plain SVD + refine, NO smoothing, per-group W and A, rank to taste.
  3. This recipe is calibration-free (no smooth → absmax unused; no whiten → Gram unused; acts dynamic) — calib size/content is irrelevant to the current champions; it only matters if whitening returns or for future QAT. (NB the jasperai/monet calib set is NOT paintings — it's diverse photographic data, 260/400 captions mention text/signs — so content-narrowness was never a confound; "Monet" is just the dataset name.)
  4. Qualitative (24 montages reviewed): per-group fixes gross text destruction, anatomy collapse and composition smearing entirely; residual A4 damage = symbolic precision — single-glyph text errors ("PAGE"→"PAYE"/"PACE"), counting flicker (2 vs 3 eggs, seed-fragile, non-monotone in loss), slight identity drift. no-smooth cells are visually cleanest (hands track the metric).
  5. Act group-size ladder (unit test): per-token 0.60 → g128 0.17 → g64 0.13 → g32 0.10 → g16 0.077 — g16 (NVFP4's native group) is the queued next knob (sim-only for INT4; deployable as NVFP4 on Blackwell).

ACTIVE TRACK — W4A8 SVDQuant (fake-quant quality study; A100-era grid below)

Quantize all 100 block Linears: smooth → SVD rank-r low-rank (16-bit) → 4-bit residual + 8-bit per-token activations. smaller = quantized-weight bytes vs bf16 (a real low-bit kernel realizes this; fake-quant here measures quality only — no wall/flop yet on A100).

Three composable knobs (all W4A8, α=0.5, group=64, same fixed eval batch):

  • plain = SVD of smoothed weight (base SVDQuant paper's headline derivation)
  • whiten = activation-aware SVD minimizing OUTPUT error ‖X̂(Ŵ−L)‖ (ASVD/SVD-LLM idea; our add)
  • +refine = iterative low-rank refinement, re-fit L to absorb 4-bit quant error (SVDQuant §4.2)

FULL 4×3 GRID (2026-06-01) — every method × every rank, fixed 300-img calib

Closes the old "L-shape, not a grid" gap: each cell built one-at-a-time on the SAME 300-image calib from data/monet_cache (latents), so all 12 are directly comparable. eval velocity-loss (lower=closer to teacher); wrecon = mean weight-recon rel-err. Dirs: outputs/abl_c300_r{R}_{variant}.

rank smaller plain plain+refine whiten whiten+refine
16 3.67× 0.0620 0.0655 0.0656 0.0556
32 3.59× 0.0586 0.0574 0.0545 0.0476
64 3.43× 0.0487 0.0446 ← grid best 0.0588 0.0451

Best per rank: r16 whiten+refine 0.0556; r32 whiten+refine 0.0476; r64 plain+refine 0.0446 (whiten+refine 0.0451 ≈ tie). Overall best = r64 plain+refine 0.0446 @ 3.43×.

Full per-cell metrics (all 12, real measured)

eval-loss = held-out velocity-matching loss · vel-relerr = velocity field rel-L2 vs teacher · wrecon = mean weight-recon rel-err (100 layers) · orecon = mean output-recon rel-err (what whitening optimizes; only computed for the whitened cells).

rank variant smaller eval-loss vel-relerr wrecon-relerr orecon-relerr
16 plain 3.67× 0.0620 0.2235 0.1269
16 plain+refine 3.67× 0.0655 0.2297 0.1251
16 whiten 3.67× 0.0656 0.2299 0.1290 0.0733
16 whiten+refine 3.67× 0.0556 0.2117 0.1273 0.0680
32 plain 3.59× 0.0586 0.2174 0.1224
32 plain+refine 3.59× 0.0574 0.2151 0.1193
32 whiten 3.59× 0.0545 0.2095 0.1257 0.0719
32 whiten+refine 3.59× 0.0476 0.1959 0.1226 0.0646
64 plain 3.43× 0.0487 0.1980 0.1163
64 plain+refine 3.43× 0.0446 0.1896 0.1110
64 whiten 3.43× 0.0588 0.2177 0.1209 0.0695
64 whiten+refine 3.43× 0.0451 0.1907 0.1155 0.0595

All three metrics track together (lower wrecon/orecon ↔ lower eval-loss) within a rank, with the notable exception that plain+refine attains the lowest wrecon at each rank yet only the lowest eval-loss at r64 — confirming weight-recon ≠ the eval objective (refine minimizes weight error, which only aligns with the velocity loss once rank is high enough). Build/eval logs: tmp/abl_r*_*.{build,eval}.log.

Findings (these OVERTURN the prior L-shape conclusion that "each upgrade compounds"):

  1. Refine is the reliable workhorse — it helps at every rank/metric EXCEPT r16-plain (0.0620→0.0655, where minimizing weight error overfits and drifts from the output-optimal point). Every rank's best variant uses refine.
  2. Whitening ALONE is unreliable at 300-calib — non-monotonic in rank: hurts r16 (0.0656>0.0620), helps r32 (0.0545<0.0586), hurts r64 (0.0588>0.0487). It's overfitting the noisy 300-image Gram; at high rank it fits more directions to bad stats and generalizes worse than plain.
  3. Strong whiten×refine interaction — refine runs IN the whitened metric, correcting whitening's overfitting. At r16 neither upgrade alone helps yet together −10% (0.0620→0.0556). At r32 they stack to the row best (0.0476). At r64 whitening adds nothing over plain+refine (0.0451 vs 0.0446).
  4. At high rank, skip whitening — r64 plain+refine (0.0446) beats/ties everything and needs no Gram (simpler + faster build). Whitening only earns its keep at moderate rank (r32) or paired w/ refine.
  5. Sweet spot is a choice: r64 plain+refine 0.0446 @ 3.43× (max quality, simplest) vs r32 whiten+refine 0.0476 @ 3.59× (a bit more compression). Both ~4–5× below the surgery frontier (0.231).

These 300-calib numbers land close to the prior 100-calib report (r32 wr 0.0476 vs 0.0494; r64 wr 0.0451 vs 0.0454) but the per-knob story is different — whitening's instability is the headline. Montages for all 12 cells (8 probe prompts each) under outputs/abl_c300_*/eval/.

OPEN — whitening needs a higher-calib re-test. Its non-monotonic, often-harmful behavior at 300 calib is consistent with Gram under-estimation. The deferred 2000-image calib re-sweep (plan.md §5) is a follow-up: does whitening become reliably beneficial with richer activation statistics? Full methodology + math (whitening, Cholesky→eigh, refinement) in report/QUANT_REPORT.md.

Mechanism ablation — what each piece buys (2026-06-01) · ⚠️ SmoothQuant HURTS at W4A8

Stripping the pipeline down to isolate each mechanism (300-calib, same eval). smaller ≈ 3.76× for the rank-0 rows (no low-rank bytes).

config eval-loss vel-relerr wrecon mean / max note
RTN W4A8 (no smooth, no SVD, s=1) 0.0573 0.2149 0.1112 / 0.1504 naive floor — yet beats the next two
SmoothQuant W4A8 (rank-0, α=0.5, no SVD) 0.0729 0.2424 0.1356 / 0.2633 smoothing makes it WORSE
+ SVD rank-16 plain (α=0.5) 0.0620 0.2235 0.1269 / 0.198 grid plain r16
+ SVD rank-64 plain 0.0487 0.1980 0.1163 / 0.155 grid plain r64
grid best: r64 plain+refine (α=0.5) 0.0446 0.1896 0.1110 / — best so far

Headline finding: SmoothQuant at α=0.5 is actively harmful for W4A8. Removing it (RTN, s=1) beats SmoothQuant rank-0 by −21% (0.0729→0.0573) AND beats the smoothed SVD cells at r16/r32. Mechanism: SmoothQuant migrates outliers out of activations into weights — a win only when activations are the hard part (low-bit A4). At W4A8 the 8-bit activations are already easy, so migration buys nothing there and widens the weight distribution, making the 4-bit weight quant harder (worst-layer wrecon 0.15→0.26). Implication: the entire α=0.5 grid is mis-tuned — the SVD branch was partly compensating for smoothing damage. Re-running with no-smooth / low-α should beat 0.0446. Next: run the best config with SMOOTH=0, then an α sweep {0, 0.25, 0.5}. Knob: SMOOTH=0 env on scripts/12 (s=1). Montages: outputs/abl_c300_r0_nosvd{,_nosmooth}/eval/.

SHELVED TRACK — block surgery (depth-prune single blocks → surrogates → distill)

config params smaller wall flop eval-loss status
teacher 4B (baseline) 3.876B 1.00x 1.00x reference
v1 per-token drop-12 (SVD-energy) 2.441B 37% 1.45x 1.64x COLLAPSED (non-functional)
per-token drop-6 (importance) 3.158B 19% 1.19x 1.24x 0.308 ok, soft
linattn drop-6 (simple elu+1) 3.177B 18% 1.15x 1.23x 0.253 ok
linattn drop-6 +RoPE+conv+warmstart 3.177B 18% 1.15x 1.23x 0.231 BEST QUALITY
linattn drop-8 +focused+FFN(all) 2.995B 23% 1.20x 1.28x 0.269 best colors/local
linattn drop-10 mixed (4 FFN+6 light) 2.737B 29% 1.26x 1.42x 0.322 KILLED ~step200

Xet Storage Details

Size:
30 kB
·
Xet hash:
ffda64d2dcb49d26d7e27f25bd2c9438de928a13f110d0d7ad97c1d056279bbf

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.