Buckets:
Results — FLUX.2 klein 4B -> compressed student
Eval = held-out velocity-matching loss vs teacher (lower=closer; same fixed first-16 batch across all rows, so quant and surgery sit on ONE axis). wall=measured s/img @512/4-step batch-1 on A100; flop=estimated transformer-only ratio.
★★ 2026-06-14 — NVFP4 HEAD-TO-HEAD (image-space metrics, N=512) — separate axis
A matched, paired comparison on N=512 MJHQ-30k prompts, 512px, 4 steps, guidance 1.0, seed=idx,
using image-space fidelity-to-teacher metrics (LPIPS/PSNR/FID vs the teacher's own outputs) + a
PickScore/CLIP semantic check. This is NOT the velocity-loss axis above — do not compare the
numbers across sections; compare only within this table. All models share the same Qwen3 TE + VAE
(only the transformer quant varies). Full write-up: report/HEADTOHEAD_klein4b_nvfp4.md; raw numbers
outputs/eval/h2h/metrics.json; speed outputs/nvfp4/benchmark_headtohead.json.
| model | bits | PickScore↑ | CLIP↑ | LPIPS↓ | PSNR↑ | FID↓(teacher) | FID(real) | real-kernel speed |
|---|---|---|---|---|---|---|---|---|
| A teacher (bf16) | 16 | 21.64 | 30.95 | — | — | — | 89.6 | 1.0× (0.464 s/img @512) |
| D plain NVFP4 r0 | W4A4 | 21.62 | 30.95 | 0.2076 | 17.44 | 39.31 | 88.5 | — (fake-q) |
| ours r128 (fake-q) | W4A4 | 21.61 | 30.99 | 0.1732 | 18.50 | 33.37 | 89.6 | — (fake-q) |
| C ours r128 (REAL Nunchaku kernel) | W4A4 | 21.62 | 30.94 | 0.1668 | 18.71 | 33.54 | 90.1 | 1.76×@512 / 1.87×@1024, 12.6 GB |
| E BFL official FP8 | W8A8 | 21.65 | 30.94 | 0.0798 | 23.02 | 18.81 | 89.6 | — (needs TensorRT) |
| Δ low-rank branch (r0→r128) | ≈0 | ≈0 | −19.7% | +1.27 dB | −14.7% | ~flat | — |
Findings. (1) The SVDQuant low-rank branch helps at NVFP4 W4A4 — r0→r128: LPIPS −19.7%,
PSNR +1.27 dB, FID-vs-teacher −14.7% (fake-q-vs-fake-q ablation agrees: −16.6% / +1.06 dB / −15.1%);
reproduces the prior N=256 result at N=512. (2) The real Nunchaku FP4 kernel reproduces (slightly
beats) the fake-quant (LPIPS 0.167 vs 0.173) → the gain holds on the deployed model. (3) No
semantic loss — PickScore/CLIP flat across all incl. teacher. (4) BFL official FP8 is closest to
teacher (8-bit; high-precision/low-speedup point) vs our 4-bit/2.5×-kernel point. (5) FID-vs-real is
flat (88–90 incl. teacher) — tracks the klein-vs-MJHQ style gap, not quant; reported, not
discriminating. (6) BFL official NVFP4 (model B) could not be run — cutlass tensor-core swizzled
layout, needs BFL's TensorRT runtime (see report/HEADTOHEAD_klein4b_nvfp4.md §5); D (our plain r0) is
the labeled controlled "plain NVFP4" stand-in, E is the real BFL baseline. No BFL number was fabricated.
★ 2026-06-13 — NVFP4 (Blackwell-native FP4) + first REAL kernel speed (Nunchaku)
Two things landed this day: (1) NVFP4 added to our fake-quant SVDQuant (flux2distill/svdquant.py)
and swept; (2) the first real low-bit kernel speed numbers, by calling Nunchaku's compiled
NVFP4 W4A4 GEMM directly. NVFP4 beats INT4 on BOTH quality and speed — it's the format for this
box (and any Blackwell / RTX 50 / B200). Same eval axis as the 2026-06-10 cells below.
NVFP4 format: E2M1 elements (the 8 magnitudes {0,.5,1,1.5,2,3,4,6}·sign) + group-16 blocks +
FP8(E4M3) block scales, applied to the 4-bit residual weights (and optionally activations). The
low-rank branch stays bf16 — it is the high-precision error-/outlier-absorbing path. Knobs:
WFMT={int,nvfp4}, AFMT={int,nvfp4,fp8} on scripts/12 (driver scripts/run_nvfp4_cell.sh).
Quality — NVFP4 sweep (klein-4B, plain+refine, no-smooth, same held-out velocity loss)
| # | weights | acts | rank | wrecon mean | eval-loss | vs reference |
|---|---|---|---|---|---|---|
| 1 | NVFP4 g16 | NVFP4 g16 | 32 | 0.0865 | 0.0390 | — |
| 2 | NVFP4 g16 | NVFP4 g16 | 64 | 0.0817 | 0.0364 | INT4 W4A4 r64 = 0.0742 (2.0× worse) |
| 3 | NVFP4 g16 | NVFP4 g16 | 128 | 0.0742 | 0.0303 | INT4 W4A4 r128 = 0.0610 (2.0×); ≈ INT4 W4A8 champ 0.0297 |
| 4 | NVFP4 g16 | FP8 E4M3 | 64 | 0.0817 | 0.0204 | INT4 W4A8 r64 = 0.0297 (prior overall best) |
| 5 | NVFP4 g16 | FP8 E4M3 | 128 | 0.0742 | 0.0169 ★ | −43% vs the 0.0297 INT4 W4A8 champion |
NEW OVERALL QUANT CHAMPION: NVFP4-weights + FP8-acts, r128 = 0.0169. Dirs outputs/nvfp4_*.
Findings: (1) NVFP4 weights ≫ INT4 weights — ~2× lower loss at matched rank; driver is the finer group-16 + E2M1 float grid (unit test: outlier column 0.064 vs INT4-g64 0.115). (2) NVFP4 W4A4 r128 (0.0303) matches the INT4 W4A8 champion (0.0297) while keeping activations at 4-bit — full W4A4 at 8-bit-act quality. (3) FP8 acts buy more quality (0.0169) but cost speed (below). Visually the champions are teacher-indistinguishable on the text + hand probes (the quant-sensitive ones).
Speed — REAL kernels, klein-4B layer shapes, T=1536, RTX PRO 4500 Blackwell (sm_120)
scripts/23_nvfp4_kernel_bench.py calls Nunchaku's compiled svdq_gemm_w4a4_cuda (NVFP4 W4A4) at
each of klein-4B's 5 distinct Linear shapes, summed over all 100 Linears. FP8 row = torch._scaled_mm
(NOT Nunchaku — a proxy for the W4+FP8-act variant, which has no fused kernel on Blackwell).
| path | ms/step | speedup vs bf16 | deploys |
|---|---|---|---|
| bf16 | 73.8 | 1.00× | baseline |
| NVFP4 W4A4 r64 (real Nunchaku kernel) | 26.8 | 2.75× | cell 2 (0.0364) |
| NVFP4 W4A4 r128 (real Nunchaku kernel) | 29.7 | 2.49× | cell 3 (0.0303) |
| FP8 proxy r64 | 60.3 | 1.22× | cell 4 (0.0204) |
| FP8 proxy r128 | 60.8 | 1.21× | cell 5 (0.0169) |
Matches the 9B end-to-end Nunchaku number (FP4 254 ms/step = 2.69× vs bf16 684; full pipeline 1.29 s/img @1024²/4-step, 24.95 GB). Rank tax (real W4A4 kernel): r64→r128 ≈ 11% (2.75×→2.49×) for the 0.0364→0.0303 quality gain — rank is a quality knob with a small, real speed cost.
End-to-end (real Nunchaku kernel, klein-4B Linears swapped to NVFP4 W4A4, rest bf16): bf16 →
NVFP4 is 1.24× @512px / 1.18× @1024px with ~28% less VRAM (16.9→12.0 GB @512). Far below the
2.5× per-layer GEMM because attention is bf16 + O(N²) and dominates (more so at 1024px), VAE +
text-encode are fixed bf16 overhead, and a Linears-only swap is unfused. The 9B Nunchaku FULL pipeline
hit 2.69× precisely because it ALSO fuses attention/quant and is more GEMM-heavy — so the lever for a
real 4B speedup is the fully-fused NunchakuFlux2 model (fused attention), not just quantized Linears.
Rank tax is negligible end-to-end here (low-rank branch tiny vs the rest). scripts/24_nunchaku_e2e_speed.py.
★ FULLY-FUSED end-to-end (the real lever): converting klein-4B to the NunchakuFlux2Transformer2DModel
fused path (fused_qkv_norm_rottary + attention_fp16 + W4A4 GEMM, no bf16 SDPA) gives 1.76× @512px
and 1.87× @1024px (per-step 2.2×/2.29×, VRAM −26%: 18.6→13.7 GB @1024). This is the real number —
fusing attention lifts the Linears-only 1.2× to 1.9×, and unlike the Linears-only swap it improves
with resolution. The transformer step (2.3×) ≈ the per-layer GEMM (2.5×); end-to-end is capped by the
4B's VAE+text-encode (30% of its small total). scripts/25_fused_4b_speed.py (dummy weights — speed is
value-independent; built from source via /workspace/build_nunchaku). Deployable single artifact (fast + correct weights) — ✅ DONE.
★ DEPLOYABLE NVFP4-fused klein-4B (correct weights + real kernel). Wrote our own NVFP4 weight
exporter (flux2distill/nunchaku_export.py): per-Linear SVD low-rank + iterative refine → NVFP4 residual (E2M1 codes + per-group-16 FP8 wscales + per-tensor alpha, wcscales=1), packed into
Nunchaku's MMA layout via their pack_weight/pack_micro_scale/pack_lowrank_weight. Convention
validated (scripts/26): with pre-quantized acts the real kernel reconstructs our intended weight to
2.99%. scripts/27_convert_full_4b.py converts all 120 fused Linears (handling the qkv fusion +
the single-block to_qkv_mlp_proj/to_out splits), loads them into NunchakuFlux2Transformer2DModel,
and generates correct images on the real FP4 kernel — teacher-indistinguishable on the text
("THE OPEN PAGE" legible) and hand probes (montages outputs/nvfp4/deploy/). This is the full
deployable model: NVFP4 quality (≈0.0303) + the fused ~1.9× speed + −26% VRAM in one artifact.
(NB the fused packed-rotary path requires batch=1 generation.)
The quality↔speed fork (both deployable on Blackwell, pick one):
- Speed: NVFP4 W4A4 r128 → 0.0303 @ 2.49× (real Nunchaku kernel, loads today). r64 → 0.0364 @ 2.75×.
- Quality: NVFP4-W + FP8-A r128 → 0.0169, but FP8 compute caps at ~2× bf16 (FP8 tensor cores are half FP4 throughput) and has no fused kernel (1.2× measurable today). The tradeoff is physical.
Why INT4 is the wrong format on Blackwell (hardware, sourced): the 5th-gen tensor cores natively do
FP4/FP6/FP8/INT8/BF16/… but dropped INT4 (Turing/Ampere/Ada had INT4 IMMA; sm_120 doesn't). Nunchaku
ships INT4 for Turing/Ampere/Ada and NVFP4 for Blackwell; get_precision() returns fp4 here.
Forcing svdq-int4 on this card → 1677 ms/step (slower than bf16). So INT4 W4A4 stays the deployable
format for the huge RTX 20/30/40 base; NVFP4 is for Blackwell/B200 — complementary, one per generation.
⚠️ 2026-06-10 — box rebuilt AGAIN; eval axis SHIFTED; old numbers not comparable
The box was rebuilt (A100-80GB → RTX PRO 4500 Blackwell 32 GB, system python, torch
2.12+cu130, transformers 5.9→5.10.2, diffusers git Jun-1→Jun-10). Re-evaluating the UNCHANGED
grid-best checkpoint (r64 plain+refine, zero missing/unexpected keys on load) gives
0.0325 (vel-relerr 0.1661) vs the recorded 0.0446 (0.1896) — a −27% instrument shift,
NOT a model change. Cause: the 16-sample eval ran through different SDPA kernels, different
Qwen3 prompt embeddings (transformers bump), and possibly different seeded σ/noise draws; with
N=16 that easily moves the mean. Rule: numbers are only comparable WITHIN one box era. The
4×3 grid + mechanism-ablation tables below are the OLD axis (still internally consistent);
every 2026-06-10+ experiment is compared against same-box re-evals. Re-anchored baseline:
r64 plain+refine (α=0.5, smoothed) = 0.0325 (outputs/recheck_r64_plain_refine, montages
in its eval/).
Montage read of that baseline on the new box (teacher|quant, 8 probes): texture/scene/large-text probes (storefront, lake, neon, spiderweb, fisherman) ≈ indistinguishable; the quant misses the counting probe (2 fried eggs vs the teacher's correct 3) and mangles the hand probe (folded/ merged fingers); chalkboard small-print is gibberish in BOTH (teacher limitation). Counting + hand are the sensitive probes to watch across the SMOOTH=0 ablation.
✅ SMOOTH=0 ablation (2026-06-10, new box) — CONFIRMED: drop SmoothQuant at W4A8
The #1 queued experiment, run at 3 ranks (plain+refine, 300-calib, all new-axis; each SMOOTH=0
build compared against a same-box re-eval of its smoothed α=0.5 twin). Dirs:
outputs/abl_c300_r{16,32,64}_plain_refine_nosmooth vs outputs/recheck_r{16,32,64}_plain_refine.
| rank | smoothed α=0.5 (re-eval) | SMOOTH=0 | Δ eval-loss | wrecon mean (sm → ns) | ns wrecon max |
|---|---|---|---|---|---|
| 16 | 0.0405 / rel 0.1855 | 0.0348 / 0.1719 | −14.1% | 0.1251 → 0.1041 | 0.1379 |
| 32 | 0.0362 / rel 0.1753 | 0.0331 / 0.1675 | −8.6% | 0.1193 → 0.1003 | 0.1286 |
| 64 | 0.0325 / rel 0.1661 | 0.0297 / 0.1588 | −8.6% | 0.1110 → 0.0944 | 0.1124 |
Findings:
- SMOOTH=0 wins at every rank — new overall best: r64 plain+refine no-smooth = 0.0297.
SMOOTH=0is now the DEFAULT for all W4A8 builds. - The win grows as rank shrinks (−8.6% at r64/r32 → −14.1% at r16): the SVD branch was partly compensating smoothing damage, and with less capacity more damage shows through. Confirms the mechanism from the rank-0 ablation (smoothing widens the 4-bit weight spread).
- No-smooth buys ~one rank tier for free: ns-r16 (0.0348) beats smoothed-r32 (0.0362); ns-r32 (0.0331) ≈ smoothed-r64 (0.0325). I.e. same quality at ~4% more compression.
- Visual (8-probe montages): the hand probe — the most quant-sensitive — is visibly FIXED vs the smoothed baseline (plausible spread fingers at all 3 ranks vs mangled/merged at the smoothed r64 baseline). The counting probe (3 eggs) is still missed by EVERY quant (2 eggs), smoothed or not, all ranks — a rank/smooth-independent semantic drift. Large in-image text is always preserved.
- wrecon improves ~15% at every rank and the worst layer drops to 0.11–0.14 (vs 0.15–0.26 smoothed) — exactly the predicted mechanism.
✅ α sweep (2026-06-10) — the dial has NO good setting; low α is the WORST
α ∈ {off, 0.1, 0.25, 0.5} × r{32,64}, plain+refine, 300-calib, new axis
(dirs outputs/abl_c300_r{32,64}_plain_refine_a{10,25}; off/0.5 cells from above):
| α | r64 eval-loss | r32 eval-loss | r64 wrecon | r32 wrecon |
|---|---|---|---|---|
| off | 0.0297 | 0.0331 | 0.0944 | 0.1003 |
| 0.1 | 0.0380 ⚠ worst | 0.0408 ⚠ worst | 0.0999 | 0.1057 |
| 0.25 | 0.0317 | 0.0349 | 0.1007 | 0.1069 |
| 0.5 | 0.0325 | 0.0362 | 0.1110 | 0.1193 |
Identical ordering at both ranks (off < 0.25 < 0.5 ≪ 0.1) — a replicated U-shape:
- No α beats off — every dose of migration hurts at W4A8;
SMOOTH=0is the permanent default. - α=0.1 is the worst point, not the safest: at low α the factor ≈
1/max|W|^(1-α)(the weight-equalizing extreme). wrecon barely moves (0.0999 vs 0.1007 at r64) yet eval-loss jumps +20% — the rescale rotates residual quant error into output-relevant directions. The sharpest demonstration yet that weight-recon ≠ the eval objective; never tune α by wrecon. - Montages agree: the hand probe (cleanly rendered at SMOOTH=0) regresses to merged fingers at α=0.1. Counting (3 eggs) fails everywhere, α-independent.
- Determinism note: a repeat eval of an unchanged cell reproduces its loss exactly (0.0331 → 0.0331), and rebuilt cells reproduce wrecon bit-for-bit — cell deltas on this box are real, not run-to-run noise.
The old-axis grid below retains its per-knob story (refine reliable, whitening unstable at 300-calib) but all its α=0.5 absolute numbers are superseded by no-smooth.
W4A4 ablation (2026-06-10, 3 cells) — the smoothing flip CONFIRMED; naive A4 not viable yet
All plain+refine, 300-calib, new axis. ABITS=4 (per-token dynamic 4-bit acts — same scheme as
A8, just 16 levels). Dirs outputs/abl_c300_r{64,128}_w4a4_plain_refine_{nosmooth,a50}.
| cell | eval-loss | vel-relerr | wrecon mean | vs W4A8 twin |
|---|---|---|---|---|
| r64 SMOOTH=0 (A8 recipe) | 0.5103 | 0.6582 | 0.0944 | 0.0297 → 17× worse |
| r64 α=0.5 | 0.3885 | 0.5743 | 0.1111 | smoothing helps +24% |
| r128 α=0.5 | 0.3060 | 0.5097 | 0.0992 | rank helps +21% |
Findings:
- The smoothing flip is real and symmetric. At A8 smoothing hurt −9%; at A4 it helps +24%. Its value is purely a function of activation bit-width — at A4 the per-token 4-bit quant is destroyed by channel outliers (one outlier forces the whole token's scale → small channels round to 0), which is exactly what migration mitigates. Mechanism fully closed.
- Rank matters more at A4 than A8 (r64→r128: −21%): the low-rank branch runs on FULL- precision activations, so extra rank routes more computation around the 4-bit bottleneck — at A4 rank is an activation shield, not just a weight-outlier absorber.
- Naive W4A4 is NOT viable: best cell 0.3060 is ~10× the W4A8 best (0.0297). Visually: no-smooth A4 shatters in-image text ("THE OPEN PAGE" → "PINE OPEEN I AAGE"); α=0.5 restores readable text; r128 renders the storefront cleanly — but composition/anatomy stay broken (hand probe: two wrong-gesture hands). (No cross-axis comparison to the old surgery numbers.)
- Next lever (queued): per-group activation quant — give each group of ~64 channels its own dynamic scale (the weight side already does this; Atom/QServe-style). Attacks the outlier problem per-token/per-group with zero weight-side cost — expected to beat the whole α dial.
W4A4 α-up sweep (2026-06-10, 3 more cells) — A4 has an INTERIOR optimum at α≈0.75
plain+refine, 300-calib, new axis. Dirs outputs/abl_c300_r{64,128}_w4a4_plain_refine_a{75,100}.
| α | r64 eval-loss | r128 eval-loss | wrecon mean/max (r64 · r128) |
|---|---|---|---|
| off | 0.5103 | — | 0.0944/0.1124 · — |
| 0.5 | 0.3885 | 0.3060 | 0.1111/0.1486 · 0.0992/0.1344 |
| 0.75 | 0.2819 | 0.2080 ★ W4A4 best | 0.1345/0.2585 · 0.1168/0.1960 |
| 1.0 | — | 0.2397 | — · 0.1469/0.3423 |
Findings:
- α ≈ 0.75 at (per-token) A4* — the curve descends to 0.75 then turns at 1.0: full flattening wrecks the weights (worst-layer 0.34) faster than it relieves the activations, even with r128+refine absorbing. Campaign symmetry: the optimal α tracks the bottleneck (A8 → off; per-token A4 → 0.75; nothing in between wins at either).
- W4A4 improved 0.5103 → 0.2080 (−59%) via smoothing+rank alone — still ~7× the W4A8 champion (0.0297). Visual at the best cell: in-image text nearly clean (one corrupted glyph), coherent compositions return, but counting/gesture still wrong. Not deployable yet.
- ⚠ Paper-spec correction (from re-reading SVDQuant): the paper's W4A4 uses per-GROUP activations (group 64, like its weights; NVFP4 = group 16) — our per-token activations reproduce their baselines (ViDiT-Q/MixDQ, which fail catastrophically exactly like our cells). So per-group acts isn't an enhancement, it's the missing piece of the actual SVDQuant W4A4 recipe → implemented below; step change confirmed.
✅ W4A4 per-group activations (2026-06-10) — the fix; W4A4 becomes viable
Implemented a_group (per-group dynamic act scales along channels; AGROUP env on scripts/12,
recorded in quant_config.json). Unit test: one 60× outlier channel → per-token A4 rel-err 0.59,
g64 0.11 (5×). 2×2 grid {r64,r128} × {SMOOTH=0, α=0.5}, AGROUP=64, plain+refine, 300-calib.
Dirs outputs/abl_c300_r{64,128}_w4a4g64_{nosmooth,a50}.
| W4A4 g64 acts | SMOOTH=0 | α=0.5 | (per-token, for scale) |
|---|---|---|---|
| r64 | 0.0742 | 0.0759 | 0.5103 (ns) / 0.3885 (a50) |
| r128 | 0.0610 ★ W4A4 best | 0.0620 | — / 0.3060 (a50) |
Findings:
- Per-group acts (g64) is THE W4A4 fix: −85% (0.5103 → 0.0742 at r64-ns) — far beyond everything the α/rank campaign bought combined. W4A4 best now 0.0610 (r128-ns), ~2× the W4A8 champion (0.0297) instead of 17×.
- With per-group acts, smoothing is dead weight again (a50 slightly worse at both ranks) — the outlier problem belongs to quantizer granularity, not weight-side migration, at every bit-width. The W4A4 recipe converges to the same clean form as W4A8: plain SVD + refine, NO smoothing, per-group W and A, rank to taste.
- This recipe is calibration-free (no smooth → absmax unused; no whiten → Gram unused;
acts dynamic) — calib size/content is irrelevant to the current champions; it only matters
if whitening returns or for future QAT. (NB the
jasperai/monetcalib set is NOT paintings — it's diverse photographic data, 260/400 captions mention text/signs — so content-narrowness was never a confound; "Monet" is just the dataset name.) - Qualitative (24 montages reviewed): per-group fixes gross text destruction, anatomy collapse and composition smearing entirely; residual A4 damage = symbolic precision — single-glyph text errors ("PAGE"→"PAYE"/"PACE"), counting flicker (2 vs 3 eggs, seed-fragile, non-monotone in loss), slight identity drift. no-smooth cells are visually cleanest (hands track the metric).
- Act group-size ladder (unit test): per-token 0.60 → g128 0.17 → g64 0.13 → g32 0.10 → g16 0.077 — g16 (NVFP4's native group) is the queued next knob (sim-only for INT4; deployable as NVFP4 on Blackwell).
ACTIVE TRACK — W4A8 SVDQuant (fake-quant quality study; A100-era grid below)
Quantize all 100 block Linears: smooth → SVD rank-r low-rank (16-bit) → 4-bit residual +
8-bit per-token activations. smaller = quantized-weight bytes vs bf16 (a real low-bit
kernel realizes this; fake-quant here measures quality only — no wall/flop yet on A100).
Three composable knobs (all W4A8, α=0.5, group=64, same fixed eval batch):
- plain = SVD of smoothed weight (base SVDQuant paper's headline derivation)
- whiten = activation-aware SVD minimizing OUTPUT error ‖X̂(Ŵ−L)‖ (ASVD/SVD-LLM idea; our add)
- +refine = iterative low-rank refinement, re-fit L to absorb 4-bit quant error (SVDQuant §4.2)
FULL 4×3 GRID (2026-06-01) — every method × every rank, fixed 300-img calib
Closes the old "L-shape, not a grid" gap: each cell built one-at-a-time on the SAME 300-image
calib from data/monet_cache (latents), so all 12 are directly comparable. eval velocity-loss
(lower=closer to teacher); wrecon = mean weight-recon rel-err. Dirs: outputs/abl_c300_r{R}_{variant}.
| rank | smaller | plain | plain+refine | whiten | whiten+refine |
|---|---|---|---|---|---|
| 16 | 3.67× | 0.0620 | 0.0655 | 0.0656 | 0.0556 |
| 32 | 3.59× | 0.0586 | 0.0574 | 0.0545 | 0.0476 |
| 64 | 3.43× | 0.0487 | 0.0446 ← grid best | 0.0588 | 0.0451 |
Best per rank: r16 whiten+refine 0.0556; r32 whiten+refine 0.0476; r64 plain+refine 0.0446 (whiten+refine 0.0451 ≈ tie). Overall best = r64 plain+refine 0.0446 @ 3.43×.
Full per-cell metrics (all 12, real measured)
eval-loss = held-out velocity-matching loss · vel-relerr = velocity field rel-L2 vs teacher ·
wrecon = mean weight-recon rel-err (100 layers) · orecon = mean output-recon rel-err (what
whitening optimizes; only computed for the whitened cells).
| rank | variant | smaller | eval-loss | vel-relerr | wrecon-relerr | orecon-relerr |
|---|---|---|---|---|---|---|
| 16 | plain | 3.67× | 0.0620 | 0.2235 | 0.1269 | — |
| 16 | plain+refine | 3.67× | 0.0655 | 0.2297 | 0.1251 | — |
| 16 | whiten | 3.67× | 0.0656 | 0.2299 | 0.1290 | 0.0733 |
| 16 | whiten+refine | 3.67× | 0.0556 | 0.2117 | 0.1273 | 0.0680 |
| 32 | plain | 3.59× | 0.0586 | 0.2174 | 0.1224 | — |
| 32 | plain+refine | 3.59× | 0.0574 | 0.2151 | 0.1193 | — |
| 32 | whiten | 3.59× | 0.0545 | 0.2095 | 0.1257 | 0.0719 |
| 32 | whiten+refine | 3.59× | 0.0476 | 0.1959 | 0.1226 | 0.0646 |
| 64 | plain | 3.43× | 0.0487 | 0.1980 | 0.1163 | — |
| 64 | plain+refine | 3.43× | 0.0446 | 0.1896 | 0.1110 | — |
| 64 | whiten | 3.43× | 0.0588 | 0.2177 | 0.1209 | 0.0695 |
| 64 | whiten+refine | 3.43× | 0.0451 | 0.1907 | 0.1155 | 0.0595 |
All three metrics track together (lower wrecon/orecon ↔ lower eval-loss) within a rank, with the
notable exception that plain+refine attains the lowest wrecon at each rank yet only the lowest
eval-loss at r64 — confirming weight-recon ≠ the eval objective (refine minimizes weight error,
which only aligns with the velocity loss once rank is high enough). Build/eval logs: tmp/abl_r*_*.{build,eval}.log.
Findings (these OVERTURN the prior L-shape conclusion that "each upgrade compounds"):
- Refine is the reliable workhorse — it helps at every rank/metric EXCEPT r16-plain (0.0620→0.0655, where minimizing weight error overfits and drifts from the output-optimal point). Every rank's best variant uses refine.
- Whitening ALONE is unreliable at 300-calib — non-monotonic in rank: hurts r16 (0.0656>0.0620), helps r32 (0.0545<0.0586), hurts r64 (0.0588>0.0487). It's overfitting the noisy 300-image Gram; at high rank it fits more directions to bad stats and generalizes worse than plain.
- Strong whiten×refine interaction — refine runs IN the whitened metric, correcting whitening's overfitting. At r16 neither upgrade alone helps yet together −10% (0.0620→0.0556). At r32 they stack to the row best (0.0476). At r64 whitening adds nothing over plain+refine (0.0451 vs 0.0446).
- At high rank, skip whitening — r64 plain+refine (0.0446) beats/ties everything and needs no Gram (simpler + faster build). Whitening only earns its keep at moderate rank (r32) or paired w/ refine.
- Sweet spot is a choice: r64 plain+refine 0.0446 @ 3.43× (max quality, simplest) vs r32 whiten+refine 0.0476 @ 3.59× (a bit more compression). Both ~4–5× below the surgery frontier (0.231).
These 300-calib numbers land close to the prior 100-calib report (r32 wr 0.0476 vs 0.0494; r64 wr 0.0451
vs 0.0454) but the per-knob story is different — whitening's instability is the headline. Montages
for all 12 cells (8 probe prompts each) under outputs/abl_c300_*/eval/.
OPEN — whitening needs a higher-calib re-test. Its non-monotonic, often-harmful behavior at 300
calib is consistent with Gram under-estimation. The deferred 2000-image calib re-sweep (plan.md §5)
is a follow-up: does whitening become reliably beneficial with richer activation statistics?
Full methodology + math (whitening, Cholesky→eigh, refinement) in report/QUANT_REPORT.md.
Mechanism ablation — what each piece buys (2026-06-01) · ⚠️ SmoothQuant HURTS at W4A8
Stripping the pipeline down to isolate each mechanism (300-calib, same eval). smaller ≈ 3.76× for
the rank-0 rows (no low-rank bytes).
| config | eval-loss | vel-relerr | wrecon mean / max | note |
|---|---|---|---|---|
| RTN W4A8 (no smooth, no SVD, s=1) | 0.0573 | 0.2149 | 0.1112 / 0.1504 | naive floor — yet beats the next two |
| SmoothQuant W4A8 (rank-0, α=0.5, no SVD) | 0.0729 | 0.2424 | 0.1356 / 0.2633 | smoothing makes it WORSE |
| + SVD rank-16 plain (α=0.5) | 0.0620 | 0.2235 | 0.1269 / 0.198 | grid plain r16 |
| + SVD rank-64 plain | 0.0487 | 0.1980 | 0.1163 / 0.155 | grid plain r64 |
| grid best: r64 plain+refine (α=0.5) | 0.0446 | 0.1896 | 0.1110 / — | best so far |
Headline finding: SmoothQuant at α=0.5 is actively harmful for W4A8. Removing it (RTN, s=1) beats
SmoothQuant rank-0 by −21% (0.0729→0.0573) AND beats the smoothed SVD cells at r16/r32. Mechanism:
SmoothQuant migrates outliers out of activations into weights — a win only when activations are the
hard part (low-bit A4). At W4A8 the 8-bit activations are already easy, so migration buys nothing
there and widens the weight distribution, making the 4-bit weight quant harder (worst-layer wrecon
0.15→0.26). Implication: the entire α=0.5 grid is mis-tuned — the SVD branch was partly
compensating for smoothing damage. Re-running with no-smooth / low-α should beat 0.0446. Next: run the
best config with SMOOTH=0, then an α sweep {0, 0.25, 0.5}. Knob: SMOOTH=0 env on scripts/12
(s=1). Montages: outputs/abl_c300_r0_nosvd{,_nosmooth}/eval/.
SHELVED TRACK — block surgery (depth-prune single blocks → surrogates → distill)
| config | params | smaller | wall | flop | eval-loss | status |
|---|---|---|---|---|---|---|
| teacher 4B (baseline) | 3.876B | — | 1.00x | 1.00x | — | reference |
| v1 per-token drop-12 (SVD-energy) | 2.441B | 37% | 1.45x | 1.64x | — | COLLAPSED (non-functional) |
| per-token drop-6 (importance) | 3.158B | 19% | 1.19x | 1.24x | 0.308 | ok, soft |
| linattn drop-6 (simple elu+1) | 3.177B | 18% | 1.15x | 1.23x | 0.253 | ok |
| linattn drop-6 +RoPE+conv+warmstart | 3.177B | 18% | 1.15x | 1.23x | 0.231 | BEST QUALITY |
| linattn drop-8 +focused+FFN(all) | 2.995B | 23% | 1.20x | 1.28x | 0.269 | best colors/local |
| linattn drop-10 mixed (4 FFN+6 light) | 2.737B | 29% | 1.26x | 1.42x | 0.322 | KILLED ~step200 |
Xet Storage Details
- Size:
- 30 kB
- Xet hash:
- ffda64d2dcb49d26d7e27f25bd2c9438de928a13f110d0d7ad97c1d056279bbf
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.