Buckets:
| # Results — FLUX.2 klein 4B -> compressed student | |
| Eval = held-out velocity-matching loss vs teacher (lower=closer; same fixed first-16 batch | |
| across all rows, so quant and surgery sit on ONE axis). wall=measured s/img @512/4-step | |
| batch-1 on A100; flop=estimated transformer-only ratio. | |
| ## ★★ 2026-06-14 — NVFP4 HEAD-TO-HEAD (image-space metrics, N=512) — separate axis | |
| A matched, paired comparison on **N=512 MJHQ-30k prompts, 512px, 4 steps, guidance 1.0, seed=idx**, | |
| using **image-space fidelity-to-teacher** metrics (LPIPS/PSNR/FID vs the teacher's own outputs) + a | |
| PickScore/CLIP semantic check. **This is NOT the velocity-loss axis above** — do not compare the | |
| numbers across sections; compare only within this table. All models share the same Qwen3 TE + VAE | |
| (only the transformer quant varies). Full write-up: `report/HEADTOHEAD_klein4b_nvfp4.md`; raw numbers | |
| `outputs/eval/h2h/metrics.json`; speed `outputs/nvfp4/benchmark_headtohead.json`. | |
| | model | bits | PickScore↑ | CLIP↑ | LPIPS↓ | PSNR↑ | FID↓(teacher) | FID(real) | real-kernel speed | | |
| |-------|------|-----------|-------|--------|-------|---------------|-----------|-------------------| | |
| | A teacher (bf16) | 16 | 21.64 | 30.95 | — | — | — | 89.6 | 1.0× (0.464 s/img @512) | | |
| | D plain NVFP4 r0 | W4A4 | 21.62 | 30.95 | 0.2076 | 17.44 | 39.31 | 88.5 | — (fake-q) | | |
| | ours r128 (fake-q) | W4A4 | 21.61 | 30.99 | 0.1732 | 18.50 | 33.37 | 89.6 | — (fake-q) | | |
| | **C ours r128 (REAL Nunchaku kernel)** | W4A4 | 21.62 | 30.94 | **0.1668** | **18.71** | **33.54** | 90.1 | **1.76×@512 / 1.87×@1024, 12.6 GB** | | |
| | E **BFL official FP8** | W8A8 | 21.65 | 30.94 | 0.0798 | 23.02 | 18.81 | 89.6 | — (needs TensorRT) | | |
| | **Δ low-rank branch (r0→r128)** | | ≈0 | ≈0 | **−19.7%** | **+1.27 dB** | **−14.7%** | ~flat | — | | |
| **Findings.** (1) **The SVDQuant low-rank branch helps at NVFP4 W4A4** — r0→r128: LPIPS −19.7%, | |
| PSNR +1.27 dB, FID-vs-teacher −14.7% (fake-q-vs-fake-q ablation agrees: −16.6% / +1.06 dB / −15.1%); | |
| reproduces the prior N=256 result at N=512. (2) **The real Nunchaku FP4 kernel reproduces (slightly | |
| beats) the fake-quant** (LPIPS 0.167 vs 0.173) → the gain holds on the deployed model. (3) **No | |
| semantic loss** — PickScore/CLIP flat across all incl. teacher. (4) **BFL official FP8** is closest to | |
| teacher (8-bit; high-precision/low-speedup point) vs our 4-bit/2.5×-kernel point. (5) **FID-vs-real is | |
| ~flat (~88–90 incl. teacher)** — tracks the klein-vs-MJHQ style gap, not quant; reported, not | |
| discriminating. (6) **BFL official NVFP4 (model B) could not be run** — cutlass tensor-core swizzled | |
| layout, needs BFL's TensorRT runtime (see `report/HEADTOHEAD_klein4b_nvfp4.md §5`); D (our plain r0) is | |
| the labeled controlled "plain NVFP4" stand-in, E is the real BFL baseline. No BFL number was fabricated. | |
| ## ★ 2026-06-13 — NVFP4 (Blackwell-native FP4) + first REAL kernel speed (Nunchaku) | |
| Two things landed this day: (1) NVFP4 added to our fake-quant SVDQuant (`flux2distill/svdquant.py`) | |
| and swept; (2) the first **real** low-bit kernel speed numbers, by calling Nunchaku's compiled | |
| NVFP4 W4A4 GEMM directly. **NVFP4 beats INT4 on BOTH quality and speed** — it's the format for this | |
| box (and any Blackwell / RTX 50 / B200). Same eval axis as the 2026-06-10 cells below. | |
| **NVFP4 format:** E2M1 elements (the 8 magnitudes {0,.5,1,1.5,2,3,4,6}·sign) + group-16 blocks + | |
| FP8(E4M3) block scales, applied to the **4-bit residual weights** (and optionally activations). The | |
| low-rank branch stays **bf16** — it is the high-precision error-/outlier-absorbing path. Knobs: | |
| `WFMT={int,nvfp4}`, `AFMT={int,nvfp4,fp8}` on `scripts/12` (driver `scripts/run_nvfp4_cell.sh`). | |
| ### Quality — NVFP4 sweep (klein-4B, plain+refine, no-smooth, same held-out velocity loss) | |
| | # | weights | acts | rank | wrecon mean | **eval-loss** | vs reference | | |
| |---|---------|------|------|-------------|---------------|--------------| | |
| | 1 | NVFP4 g16 | NVFP4 g16 | 32 | 0.0865 | 0.0390 | — | | |
| | 2 | NVFP4 g16 | NVFP4 g16 | 64 | 0.0817 | 0.0364 | INT4 W4A4 r64 = 0.0742 (**2.0× worse**) | | |
| | 3 | NVFP4 g16 | NVFP4 g16 | 128 | 0.0742 | **0.0303** | INT4 W4A4 r128 = 0.0610 (**2.0×**); ≈ INT4 W4A8 champ 0.0297 | | |
| | 4 | NVFP4 g16 | FP8 E4M3 | 64 | 0.0817 | 0.0204 | INT4 W4A8 r64 = 0.0297 (prior overall best) | | |
| | 5 | NVFP4 g16 | FP8 E4M3 | 128 | 0.0742 | **0.0169** ★ | **−43% vs the 0.0297 INT4 W4A8 champion** | | |
| **NEW OVERALL QUANT CHAMPION: NVFP4-weights + FP8-acts, r128 = 0.0169.** Dirs `outputs/nvfp4_*`. | |
| Findings: (1) **NVFP4 weights ≫ INT4 weights** — ~2× lower loss at matched rank; driver is the finer | |
| group-16 + E2M1 float grid (unit test: outlier column 0.064 vs INT4-g64 0.115). (2) **NVFP4 W4A4 r128 | |
| (0.0303) matches the INT4 W4A8 champion (0.0297)** while keeping activations at 4-bit — full W4A4 at | |
| 8-bit-act quality. (3) FP8 acts buy more quality (0.0169) but cost speed (below). Visually the | |
| champions are teacher-indistinguishable on the text + hand probes (the quant-sensitive ones). | |
| ### Speed — REAL kernels, klein-4B layer shapes, T=1536, RTX PRO 4500 Blackwell (sm_120) | |
| `scripts/23_nvfp4_kernel_bench.py` calls Nunchaku's compiled `svdq_gemm_w4a4_cuda` (NVFP4 W4A4) at | |
| each of klein-4B's 5 distinct Linear shapes, summed over all 100 Linears. FP8 row = `torch._scaled_mm` | |
| (NOT Nunchaku — a proxy for the W4+FP8-act variant, which has no fused kernel on Blackwell). | |
| | path | ms/step | **speedup vs bf16** | deploys | | |
| |------|---------|---------------------|---------| | |
| | bf16 | 73.8 | 1.00× | baseline | | |
| | **NVFP4 W4A4 r64** (real Nunchaku kernel) | 26.8 | **2.75×** | cell 2 (0.0364) | | |
| | **NVFP4 W4A4 r128** (real Nunchaku kernel) | 29.7 | **2.49×** | cell 3 (0.0303) | | |
| | FP8 proxy r64 | 60.3 | 1.22× | cell 4 (0.0204) | | |
| | FP8 proxy r128 | 60.8 | 1.21× | cell 5 (0.0169) | | |
| Matches the 9B end-to-end Nunchaku number (FP4 254 ms/step = **2.69×** vs bf16 684; full pipeline | |
| 1.29 s/img @1024²/4-step, 24.95 GB). **Rank tax (real W4A4 kernel): r64→r128 ≈ 11%** (2.75×→2.49×) | |
| for the 0.0364→0.0303 quality gain — rank is a quality knob with a small, real speed cost. | |
| **End-to-end (real Nunchaku kernel, klein-4B Linears swapped to NVFP4 W4A4, rest bf16):** bf16 → | |
| NVFP4 is **1.24× @512px / 1.18× @1024px** with **~28% less VRAM** (16.9→12.0 GB @512). Far below the | |
| 2.5× per-layer GEMM because **attention is bf16 + O(N²) and dominates** (more so at 1024px), VAE + | |
| text-encode are fixed bf16 overhead, and a Linears-only swap is unfused. The 9B Nunchaku FULL pipeline | |
| hit 2.69× precisely because it ALSO fuses attention/quant and is more GEMM-heavy — so the lever for a | |
| real 4B speedup is the fully-fused NunchakuFlux2 model (fused attention), not just quantized Linears. | |
| Rank tax is negligible end-to-end here (low-rank branch tiny vs the rest). `scripts/24_nunchaku_e2e_speed.py`. | |
| **★ FULLY-FUSED end-to-end (the real lever):** converting klein-4B to the `NunchakuFlux2Transformer2DModel` | |
| fused path (`fused_qkv_norm_rottary` + `attention_fp16` + W4A4 GEMM, no bf16 SDPA) gives **1.76× @512px | |
| and 1.87× @1024px** (per-step **2.2×/2.29×**, VRAM −26%: 18.6→13.7 GB @1024). This is the real number — | |
| fusing attention lifts the Linears-only 1.2× to ~1.9×, and unlike the Linears-only swap it *improves* | |
| with resolution. The transformer step (2.3×) ≈ the per-layer GEMM (2.5×); end-to-end is capped by the | |
| 4B's VAE+text-encode (~30% of its small total). `scripts/25_fused_4b_speed.py` (dummy weights — speed is | |
| value-independent; built from source via `/workspace/build_nunchaku`). **Deployable single artifact (fast + correct weights) — ✅ DONE.** | |
| **★ DEPLOYABLE NVFP4-fused klein-4B (correct weights + real kernel).** Wrote our own NVFP4 weight | |
| exporter (`flux2distill/nunchaku_export.py`): per-Linear `SVD low-rank + iterative refine → NVFP4 | |
| residual (E2M1 codes + per-group-16 FP8 wscales + per-tensor alpha, wcscales=1)`, packed into | |
| Nunchaku's MMA layout via their `pack_weight`/`pack_micro_scale`/`pack_lowrank_weight`. **Convention | |
| validated** (`scripts/26`): with pre-quantized acts the real kernel reconstructs our intended weight to | |
| **2.99%**. `scripts/27_convert_full_4b.py` converts all 120 fused Linears (handling the qkv fusion + | |
| the single-block `to_qkv_mlp_proj`/`to_out` splits), loads them into `NunchakuFlux2Transformer2DModel`, | |
| and **generates correct images on the real FP4 kernel** — teacher-indistinguishable on the text | |
| ("THE OPEN PAGE" legible) and hand probes (montages `outputs/nvfp4/deploy/`). This is the full | |
| deployable model: **NVFP4 quality (≈0.0303) + the fused ~1.9× speed + −26% VRAM in one artifact.** | |
| (NB the fused packed-rotary path requires batch=1 generation.) | |
| **The quality↔speed fork (both deployable on Blackwell, pick one):** | |
| - **Speed:** NVFP4 **W4A4** r128 → 0.0303 @ **2.49×** (real Nunchaku kernel, loads today). r64 → 0.0364 @ 2.75×. | |
| - **Quality:** NVFP4-W + **FP8-A** r128 → **0.0169**, but FP8 compute caps at ~2× bf16 (FP8 tensor cores | |
| are half FP4 throughput) and has **no fused kernel** (1.2× measurable today). The tradeoff is physical. | |
| **Why INT4 is the wrong format on Blackwell (hardware, sourced):** the 5th-gen tensor cores natively do | |
| FP4/FP6/FP8/INT8/BF16/… but **dropped INT4** (Turing/Ampere/Ada had INT4 IMMA; sm_120 doesn't). Nunchaku | |
| ships INT4 for Turing/Ampere/Ada and **NVFP4 for Blackwell**; `get_precision()` returns `fp4` here. | |
| Forcing `svdq-int4` on this card → 1677 ms/step (slower than bf16). **So INT4 W4A4 stays the deployable | |
| format for the huge RTX 20/30/40 base; NVFP4 is for Blackwell/B200** — complementary, one per generation. | |
| ## ⚠️ 2026-06-10 — box rebuilt AGAIN; eval axis SHIFTED; old numbers not comparable | |
| The box was rebuilt (A100-80GB → **RTX PRO 4500 Blackwell 32 GB**, system python, torch | |
| 2.12+cu130, transformers 5.9→5.10.2, diffusers git Jun-1→Jun-10). Re-evaluating the UNCHANGED | |
| grid-best checkpoint (`r64 plain+refine`, zero missing/unexpected keys on load) gives | |
| **0.0325** (vel-relerr 0.1661) vs the recorded **0.0446** (0.1896) — a −27% instrument shift, | |
| NOT a model change. Cause: the 16-sample eval ran through different SDPA kernels, different | |
| Qwen3 prompt embeddings (transformers bump), and possibly different seeded σ/noise draws; with | |
| N=16 that easily moves the mean. **Rule: numbers are only comparable WITHIN one box era.** The | |
| 4×3 grid + mechanism-ablation tables below are the OLD axis (still internally consistent); | |
| every 2026-06-10+ experiment is compared against same-box re-evals. Re-anchored baseline: | |
| **r64 plain+refine (α=0.5, smoothed) = 0.0325** (`outputs/recheck_r64_plain_refine`, montages | |
| in its `eval/`). | |
| Montage read of that baseline on the new box (teacher|quant, 8 probes): texture/scene/large-text | |
| probes (storefront, lake, neon, spiderweb, fisherman) ≈ indistinguishable; the quant **misses the | |
| counting probe** (2 fried eggs vs the teacher's correct 3) and **mangles the hand probe** (folded/ | |
| merged fingers); chalkboard small-print is gibberish in BOTH (teacher limitation). Counting + hand | |
| are the sensitive probes to watch across the SMOOTH=0 ablation. | |
| ### ✅ SMOOTH=0 ablation (2026-06-10, new box) — CONFIRMED: drop SmoothQuant at W4A8 | |
| The #1 queued experiment, run at 3 ranks (plain+refine, 300-calib, all new-axis; each SMOOTH=0 | |
| build compared against a same-box re-eval of its smoothed α=0.5 twin). Dirs: | |
| `outputs/abl_c300_r{16,32,64}_plain_refine_nosmooth` vs `outputs/recheck_r{16,32,64}_plain_refine`. | |
| | rank | smoothed α=0.5 (re-eval) | **SMOOTH=0** | Δ eval-loss | wrecon mean (sm → ns) | ns wrecon max | | |
| |------|--------------------------|--------------|-------------|------------------------|---------------| | |
| | 16 | 0.0405 / rel 0.1855 | **0.0348** / 0.1719 | **−14.1%** | 0.1251 → 0.1041 | 0.1379 | | |
| | 32 | 0.0362 / rel 0.1753 | **0.0331** / 0.1675 | **−8.6%** | 0.1193 → 0.1003 | 0.1286 | | |
| | 64 | 0.0325 / rel 0.1661 | **0.0297** / 0.1588 | **−8.6%** | 0.1110 → 0.0944 | 0.1124 | | |
| **Findings:** | |
| 1. **SMOOTH=0 wins at every rank — new overall best: r64 plain+refine no-smooth = 0.0297.** | |
| `SMOOTH=0` is now the DEFAULT for all W4A8 builds. | |
| 2. **The win grows as rank shrinks** (−8.6% at r64/r32 → −14.1% at r16): the SVD branch was | |
| partly compensating smoothing damage, and with less capacity more damage shows through. | |
| Confirms the mechanism from the rank-0 ablation (smoothing widens the 4-bit weight spread). | |
| 3. **No-smooth buys ~one rank tier for free:** ns-r16 (0.0348) beats smoothed-r32 (0.0362); | |
| ns-r32 (0.0331) ≈ smoothed-r64 (0.0325). I.e. same quality at ~4% more compression. | |
| 4. **Visual (8-probe montages):** the hand probe — the most quant-sensitive — is visibly FIXED | |
| vs the smoothed baseline (plausible spread fingers at all 3 ranks vs mangled/merged at the | |
| smoothed r64 baseline). The counting probe (3 eggs) is still missed by EVERY quant (2 eggs), | |
| smoothed or not, all ranks — a rank/smooth-independent semantic drift. Large in-image text | |
| is always preserved. | |
| 5. wrecon improves ~15% at every rank and the worst layer drops to 0.11–0.14 (vs 0.15–0.26 | |
| smoothed) — exactly the predicted mechanism. | |
| ### ✅ α sweep (2026-06-10) — the dial has NO good setting; low α is the WORST | |
| α ∈ {off, 0.1, 0.25, 0.5} × r{32,64}, plain+refine, 300-calib, new axis | |
| (dirs `outputs/abl_c300_r{32,64}_plain_refine_a{10,25}`; off/0.5 cells from above): | |
| | α | r64 eval-loss | r32 eval-loss | r64 wrecon | r32 wrecon | | |
| |---------|---------------|---------------|------------|------------| | |
| | **off** | **0.0297** | **0.0331** | 0.0944 | 0.1003 | | |
| | 0.1 | 0.0380 ⚠ worst| 0.0408 ⚠ worst| 0.0999 | 0.1057 | | |
| | 0.25 | 0.0317 | 0.0349 | 0.1007 | 0.1069 | | |
| | 0.5 | 0.0325 | 0.0362 | 0.1110 | 0.1193 | | |
| **Identical ordering at both ranks (off < 0.25 < 0.5 ≪ 0.1) — a replicated U-shape:** | |
| 1. **No α beats off** — every dose of migration hurts at W4A8; `SMOOTH=0` is the permanent default. | |
| 2. **α=0.1 is the worst point, not the safest**: at low α the factor ≈ `1/max|W|^(1-α)` (the | |
| weight-equalizing extreme). wrecon barely moves (0.0999 vs 0.1007 at r64) yet eval-loss jumps | |
| +20% — the rescale rotates residual quant error into output-relevant directions. The sharpest | |
| demonstration yet that **weight-recon ≠ the eval objective**; never tune α by wrecon. | |
| 3. Montages agree: the hand probe (cleanly rendered at SMOOTH=0) regresses to merged fingers at | |
| α=0.1. Counting (3 eggs) fails everywhere, α-independent. | |
| 4. Determinism note: a repeat eval of an unchanged cell reproduces its loss exactly (0.0331 → | |
| 0.0331), and rebuilt cells reproduce wrecon bit-for-bit — cell deltas on this box are real, | |
| not run-to-run noise. | |
| The old-axis grid below retains its per-knob story (refine reliable, whitening unstable at | |
| 300-calib) but all its α=0.5 absolute numbers are superseded by no-smooth. | |
| ### W4A4 ablation (2026-06-10, 3 cells) — the smoothing flip CONFIRMED; naive A4 not viable yet | |
| All plain+refine, 300-calib, new axis. ABITS=4 (per-token dynamic 4-bit acts — same scheme as | |
| A8, just 16 levels). Dirs `outputs/abl_c300_r{64,128}_w4a4_plain_refine_{nosmooth,a50}`. | |
| | cell | eval-loss | vel-relerr | wrecon mean | vs W4A8 twin | | |
| |----------------------------|-----------|------------|-------------|--------------| | |
| | r64 SMOOTH=0 (A8 recipe) | 0.5103 | 0.6582 | 0.0944 | 0.0297 → **17× worse** | | |
| | r64 α=0.5 | 0.3885 | 0.5743 | 0.1111 | smoothing **helps +24%** | | |
| | r128 α=0.5 | **0.3060**| 0.5097 | 0.0992 | rank **helps +21%** | | |
| **Findings:** | |
| 1. **The smoothing flip is real and symmetric.** At A8 smoothing hurt −9%; at A4 it helps +24%. | |
| Its value is purely a function of activation bit-width — at A4 the per-token 4-bit quant is | |
| destroyed by channel outliers (one outlier forces the whole token's scale → small channels | |
| round to 0), which is exactly what migration mitigates. Mechanism fully closed. | |
| 2. **Rank matters more at A4 than A8** (r64→r128: −21%): the low-rank branch runs on FULL- | |
| precision activations, so extra rank routes more computation around the 4-bit bottleneck — | |
| at A4 rank is an activation shield, not just a weight-outlier absorber. | |
| 3. **Naive W4A4 is NOT viable**: best cell 0.3060 is ~10× the W4A8 best (0.0297). Visually: | |
| no-smooth A4 shatters in-image text ("THE OPEN PAGE" → "PINE OPEEN I AAGE"); α=0.5 restores | |
| readable text; r128 renders the storefront cleanly — but composition/anatomy stay broken | |
| (hand probe: two wrong-gesture hands). (No cross-axis comparison to the old surgery numbers.) | |
| 4. **Next lever (queued): per-group activation quant** — give each group of ~64 channels its own | |
| dynamic scale (the weight side already does this; Atom/QServe-style). Attacks the outlier | |
| problem per-token/per-group with zero weight-side cost — expected to beat the whole α dial. | |
| ### W4A4 α-up sweep (2026-06-10, 3 more cells) — A4 has an INTERIOR optimum at α≈0.75 | |
| plain+refine, 300-calib, new axis. Dirs `outputs/abl_c300_r{64,128}_w4a4_plain_refine_a{75,100}`. | |
| | α | r64 eval-loss | r128 eval-loss | wrecon mean/max (r64 · r128) | | |
| |-------|---------------|----------------|-----------------------------------| | |
| | off | 0.5103 | — | 0.0944/0.1124 · — | | |
| | 0.5 | 0.3885 | 0.3060 | 0.1111/0.1486 · 0.0992/0.1344 | | |
| | 0.75 | 0.2819 | **0.2080** ★ W4A4 best | 0.1345/0.2585 · 0.1168/0.1960 | | |
| | 1.0 | — | 0.2397 | — · 0.1469/**0.3423** | | |
| **Findings:** | |
| 1. **α* ≈ 0.75 at (per-token) A4** — the curve descends to 0.75 then turns at 1.0: full | |
| flattening wrecks the weights (worst-layer 0.34) faster than it relieves the activations, | |
| even with r128+refine absorbing. Campaign symmetry: **the optimal α tracks the bottleneck** | |
| (A8 → off; per-token A4 → 0.75; nothing in between wins at either). | |
| 2. W4A4 improved 0.5103 → 0.2080 (−59%) via smoothing+rank alone — still ~7× the W4A8 champion | |
| (0.0297). Visual at the best cell: in-image text nearly clean (one corrupted glyph), coherent | |
| compositions return, but counting/gesture still wrong. Not deployable yet. | |
| 3. **⚠ Paper-spec correction (from re-reading SVDQuant):** the paper's W4A4 uses **per-GROUP | |
| activations** (group 64, like its weights; NVFP4 = group 16) — our per-token activations | |
| reproduce their *baselines* (ViDiT-Q/MixDQ, which fail catastrophically exactly like our | |
| cells). So per-group acts isn't an enhancement, it's the missing piece of the actual | |
| SVDQuant W4A4 recipe → implemented below; step change confirmed. | |
| ### ✅ W4A4 per-group activations (2026-06-10) — the fix; W4A4 becomes viable | |
| Implemented `a_group` (per-group dynamic act scales along channels; `AGROUP` env on `scripts/12`, | |
| recorded in `quant_config.json`). Unit test: one 60× outlier channel → per-token A4 rel-err 0.59, | |
| g64 0.11 (5×). 2×2 grid {r64,r128} × {SMOOTH=0, α=0.5}, AGROUP=64, plain+refine, 300-calib. | |
| Dirs `outputs/abl_c300_r{64,128}_w4a4g64_{nosmooth,a50}`. | |
| | W4A4 g64 acts | SMOOTH=0 | α=0.5 | (per-token, for scale) | | |
| |---------------|-----------|--------|-------------------------------| | |
| | r64 | **0.0742**| 0.0759 | 0.5103 (ns) / 0.3885 (a50) | | |
| | r128 | **0.0610** ★ W4A4 best | 0.0620 | — / 0.3060 (a50) | | |
| **Findings:** | |
| 1. **Per-group acts (g64) is THE W4A4 fix: −85%** (0.5103 → 0.0742 at r64-ns) — far beyond | |
| everything the α/rank campaign bought combined. W4A4 best now **0.0610** (r128-ns), ~2× the | |
| W4A8 champion (0.0297) instead of 17×. | |
| 2. **With per-group acts, smoothing is dead weight again** (a50 slightly worse at both ranks) — | |
| the outlier problem belongs to quantizer granularity, not weight-side migration, at every | |
| bit-width. The W4A4 recipe converges to the same clean form as W4A8: | |
| **plain SVD + refine, NO smoothing, per-group W and A, rank to taste.** | |
| 3. This recipe is **calibration-free** (no smooth → absmax unused; no whiten → Gram unused; | |
| acts dynamic) — calib size/content is irrelevant to the current champions; it only matters | |
| if whitening returns or for future QAT. (NB the `jasperai/monet` calib set is NOT paintings — | |
| it's diverse photographic data, 260/400 captions mention text/signs — so content-narrowness | |
| was never a confound; "Monet" is just the dataset name.) | |
| 4. Qualitative (24 montages reviewed): per-group fixes gross text destruction, anatomy collapse | |
| and composition smearing entirely; residual A4 damage = *symbolic precision* — single-glyph | |
| text errors ("PAGE"→"PAYE"/"PACE"), counting flicker (2 vs 3 eggs, seed-fragile, non-monotone | |
| in loss), slight identity drift. no-smooth cells are visually cleanest (hands track the metric). | |
| 5. Act group-size ladder (unit test): per-token 0.60 → g128 0.17 → g64 0.13 → g32 0.10 → g16 | |
| 0.077 — g16 (NVFP4's native group) is the queued next knob (sim-only for INT4; deployable | |
| as NVFP4 on Blackwell). | |
| ## ACTIVE TRACK — W4A8 SVDQuant (fake-quant quality study; A100-era grid below) | |
| Quantize all 100 block Linears: smooth → SVD rank-r low-rank (16-bit) → 4-bit residual + | |
| 8-bit per-token activations. `smaller` = quantized-weight bytes vs bf16 (a real low-bit | |
| kernel realizes this; fake-quant here measures quality only — no wall/flop yet on A100). | |
| Three composable knobs (all W4A8, α=0.5, group=64, same fixed eval batch): | |
| - **plain** = SVD of smoothed weight (base SVDQuant paper's headline derivation) | |
| - **whiten** = activation-aware SVD minimizing OUTPUT error ‖X̂(Ŵ−L)‖ (ASVD/SVD-LLM idea; our add) | |
| - **+refine** = iterative low-rank refinement, re-fit L to absorb 4-bit quant error (SVDQuant §4.2) | |
| ### FULL 4×3 GRID (2026-06-01) — every method × every rank, fixed 300-img calib | |
| Closes the old "L-shape, not a grid" gap: each cell built one-at-a-time on the SAME 300-image | |
| calib from `data/monet_cache` (latents), so all 12 are directly comparable. **eval velocity-loss** | |
| (lower=closer to teacher); `wrecon` = mean weight-recon rel-err. Dirs: `outputs/abl_c300_r{R}_{variant}`. | |
| | rank | smaller | plain | plain+refine | whiten | **whiten+refine** | | |
| |------|---------|-------|--------------|--------|-------------------| | |
| | 16 | 3.67× | 0.0620 | 0.0655 | 0.0656 | **0.0556** | | |
| | 32 | 3.59× | 0.0586 | 0.0574 | 0.0545 | **0.0476** | | |
| | 64 | 3.43× | 0.0487 | **0.0446** ← grid best | 0.0588 | 0.0451 | | |
| Best per rank: r16 **whiten+refine 0.0556**; r32 **whiten+refine 0.0476**; r64 **plain+refine 0.0446** | |
| (whiten+refine 0.0451 ≈ tie). **Overall best = r64 plain+refine 0.0446 @ 3.43×.** | |
| #### Full per-cell metrics (all 12, real measured) | |
| `eval-loss` = held-out velocity-matching loss · `vel-relerr` = velocity field rel-L2 vs teacher · | |
| `wrecon` = mean weight-recon rel-err (100 layers) · `orecon` = mean output-recon rel-err (what | |
| whitening optimizes; only computed for the whitened cells). | |
| | rank | variant | smaller | eval-loss | vel-relerr | wrecon-relerr | orecon-relerr | | |
| |------|---------------|---------|-----------|------------|---------------|---------------| | |
| | 16 | plain | 3.67× | 0.0620 | 0.2235 | 0.1269 | — | | |
| | 16 | plain+refine | 3.67× | 0.0655 | 0.2297 | 0.1251 | — | | |
| | 16 | whiten | 3.67× | 0.0656 | 0.2299 | 0.1290 | 0.0733 | | |
| | 16 | whiten+refine | 3.67× | **0.0556**| 0.2117 | 0.1273 | 0.0680 | | |
| | 32 | plain | 3.59× | 0.0586 | 0.2174 | 0.1224 | — | | |
| | 32 | plain+refine | 3.59× | 0.0574 | 0.2151 | 0.1193 | — | | |
| | 32 | whiten | 3.59× | 0.0545 | 0.2095 | 0.1257 | 0.0719 | | |
| | 32 | whiten+refine | 3.59× | **0.0476**| 0.1959 | 0.1226 | 0.0646 | | |
| | 64 | plain | 3.43× | 0.0487 | 0.1980 | 0.1163 | — | | |
| | 64 | plain+refine | 3.43× | **0.0446**| 0.1896 | 0.1110 | — | | |
| | 64 | whiten | 3.43× | 0.0588 | 0.2177 | 0.1209 | 0.0695 | | |
| | 64 | whiten+refine | 3.43× | 0.0451 | 0.1907 | 0.1155 | 0.0595 | | |
| All three metrics track together (lower wrecon/orecon ↔ lower eval-loss) within a rank, with the | |
| notable exception that **plain+refine attains the lowest wrecon at each rank yet only the lowest | |
| eval-loss at r64** — confirming weight-recon ≠ the eval objective (refine minimizes weight error, | |
| which only aligns with the velocity loss once rank is high enough). Build/eval logs: `tmp/abl_r*_*.{build,eval}.log`. | |
| **Findings (these OVERTURN the prior L-shape conclusion that "each upgrade compounds"):** | |
| 1. **Refine is the reliable workhorse** — it helps at every rank/metric EXCEPT r16-plain (0.0620→0.0655, | |
| where minimizing *weight* error overfits and drifts from the output-optimal point). Every rank's | |
| best variant uses refine. | |
| 2. **Whitening ALONE is unreliable at 300-calib** — non-monotonic in rank: hurts r16 (0.0656>0.0620), | |
| helps r32 (0.0545<0.0586), hurts r64 (0.0588>0.0487). It's overfitting the noisy 300-image Gram; | |
| at high rank it fits more directions to bad stats and generalizes *worse* than plain. | |
| 3. **Strong whiten×refine interaction** — refine runs IN the whitened metric, correcting whitening's | |
| overfitting. At r16 neither upgrade alone helps yet together −10% (0.0620→0.0556). At r32 they stack | |
| to the row best (0.0476). At r64 whitening adds nothing over plain+refine (0.0451 vs 0.0446). | |
| 4. **At high rank, skip whitening** — r64 **plain+refine (0.0446)** beats/ties everything and needs no | |
| Gram (simpler + faster build). Whitening only earns its keep at moderate rank (r32) or paired w/ refine. | |
| 5. **Sweet spot is a choice:** r64 plain+refine 0.0446 @ 3.43× (max quality, simplest) vs | |
| r32 whiten+refine 0.0476 @ 3.59× (a bit more compression). Both ~4–5× below the surgery frontier (0.231). | |
| These 300-calib numbers land close to the prior 100-calib report (r32 wr 0.0476 vs 0.0494; r64 wr 0.0451 | |
| vs 0.0454) but the *per-knob* story is different — whitening's instability is the headline. Montages | |
| for all 12 cells (8 probe prompts each) under `outputs/abl_c300_*/eval/`. | |
| OPEN — **whitening needs a higher-calib re-test.** Its non-monotonic, often-harmful behavior at 300 | |
| calib is consistent with Gram under-estimation. The deferred **2000-image calib re-sweep** (plan.md §5) | |
| is a follow-up: does whitening become reliably beneficial with richer activation statistics? | |
| Full methodology + math (whitening, Cholesky→eigh, refinement) in `report/QUANT_REPORT.md`. | |
| ### Mechanism ablation — what each piece buys (2026-06-01) · ⚠️ SmoothQuant HURTS at W4A8 | |
| Stripping the pipeline down to isolate each mechanism (300-calib, same eval). `smaller` ≈ 3.76× for | |
| the rank-0 rows (no low-rank bytes). | |
| | config | eval-loss | vel-relerr | wrecon mean / **max** | note | | |
| |------------------------------------------|-----------|------------|-----------------------|------| | |
| | **RTN W4A8** (no smooth, no SVD, s=1) | **0.0573**| 0.2149 | 0.1112 / **0.1504** | naive floor — yet *beats* the next two | | |
| | SmoothQuant W4A8 (rank-0, α=0.5, no SVD) | 0.0729 | 0.2424 | 0.1356 / **0.2633** | smoothing makes it WORSE | | |
| | + SVD rank-16 plain (α=0.5) | 0.0620 | 0.2235 | 0.1269 / 0.198 | grid `plain` r16 | | |
| | + SVD rank-64 plain | 0.0487 | 0.1980 | 0.1163 / 0.155 | grid `plain` r64 | | |
| | grid best: r64 plain+refine (α=0.5) | 0.0446 | 0.1896 | 0.1110 / — | best so far | | |
| **Headline finding: SmoothQuant at α=0.5 is actively harmful for W4A8.** Removing it (RTN, s=1) beats | |
| SmoothQuant rank-0 by **−21%** (0.0729→0.0573) AND beats the *smoothed* SVD cells at r16/r32. Mechanism: | |
| SmoothQuant migrates outliers **out of activations into weights** — a win only when activations are the | |
| hard part (low-bit A4). At **W4A8 the 8-bit activations are already easy**, so migration buys nothing | |
| there and **widens the weight distribution**, making the 4-bit weight quant harder (worst-layer wrecon | |
| 0.15→**0.26**). **Implication: the entire α=0.5 grid is mis-tuned — the SVD branch was partly | |
| compensating for smoothing damage. Re-running with no-smooth / low-α should beat 0.0446.** Next: run the | |
| best config with `SMOOTH=0`, then an α sweep {0, 0.25, 0.5}. Knob: `SMOOTH=0` env on `scripts/12` | |
| (`s=1`). Montages: `outputs/abl_c300_r0_nosvd{,_nosmooth}/eval/`. | |
| ## SHELVED TRACK — block surgery (depth-prune single blocks → surrogates → distill) | |
| | config | params | smaller | wall | flop | eval-loss | status | | |
| |------------------------------------------|---------|---------|-------|-------|-----------|--------| | |
| | teacher 4B (baseline) | 3.876B | — | 1.00x | 1.00x | — | reference | | |
| | v1 per-token drop-12 (SVD-energy) | 2.441B | 37% | 1.45x | 1.64x | — | COLLAPSED (non-functional) | | |
| | per-token drop-6 (importance) | 3.158B | 19% | 1.19x | 1.24x | 0.308 | ok, soft | | |
| | linattn drop-6 (simple elu+1) | 3.177B | 18% | 1.15x | 1.23x | 0.253 | ok | | |
| | linattn drop-6 +RoPE+conv+warmstart | 3.177B | 18% | 1.15x | 1.23x | 0.231 | BEST QUALITY | | |
| | linattn drop-8 +focused+FFN(all) | 2.995B | 23% | 1.20x | 1.28x | 0.269 | best colors/local | | |
| | linattn drop-10 mixed (4 FFN+6 light) | 2.737B | 29% | 1.26x | 1.42x | 0.322 | KILLED ~step200 | | |
Xet Storage Details
- Size:
- 30 kB
- Xet hash:
- ffda64d2dcb49d26d7e27f25bd2c9438de928a13f110d0d7ad97c1d056279bbf
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.