| # Split-Q architecture: synth-scale prototype |
|
|
| ## Motivation |
|
|
| From `v76_math_derivation.md` §7.5: vanilla Q-shift (`Q[t] = W_Q(x[t] + x[t-1])`) |
| collapses the induction bilinear from 4-deep to 2-deep, but the Q-shift conditions |
| all *failed* the Math-G1 gate at synth scale (3-5% copy_acc, 19-21% global IPMR). |
| |
| **Why Q-shift failed**: with `Q[t] = W_Q(x[t] + x[t-1])`, the QK product expands to |
| ``` |
| Q[t]·K[k] = e[t]^T A e[k] ← current-token noise (depth-2) |
| + e[t-1]^T A e[k] ← INDUCTION (depth-2) |
| + (other terms) |
| ``` |
| Both relevant terms have the same depth, so the model can't easily learn `A` to |
| upweight the induction term and downweight the current-token noise term. The |
| "current-vs-current" contamination is the bottleneck, not the bilinear depth. |
|
|
| ## Split-Q proposal (math memo §7.5) |
|
|
| Split heads into two groups: |
| - **Current heads** (first `(1-f)·H`): receive `Q[t] = W_Q · x[t]` |
| - **Prev heads** (last `f·H`): receive `Q[t] = W_Q · x[t-1]` |
|
|
| Each group has its own residual subspace (the embedding dim of those heads). |
| The prev heads' QK product is `e[t-1]^T A e[k]` — depth-2 AND uncontaminated by |
| current-token info, because those heads simply don't see `x[t]`. |
|
|
| Implementation cost: 2× q_proj forward (one for `x`, one for `x[t-1]`); same |
| parameter count. Could be optimized by splitting `W_Q` itself, but identity-matrix |
| mixing is simpler and matches existing param semantics. |
|
|
| ## Predictions (vs Q-shift baseline, --no-alibi, gumbel+±1) |
|
|
| If split-Q's "clean prev subspace" hypothesis is right: |
|
|
| 1. **E1-SQ0 top per-head IPMR > E1-Q0's 4.89%** at matched training steps. |
| 2. **E1-SQ0 global IPMR > E1-Q0's 18.8%** (more heads find structure). |
| 3. **copy_acc may stay low** at synth scale because the task isn't only induction |
| — needs the rest of the architecture (FFN, output codebook) to actually copy. |
| Best current 1-bit synth E1-LR0 only reaches 71-76% copy_acc at 5K steps. |
| 4. **E1-SQR (split-Q + RWSP)** should beat E1-LR0 (5.98% top) since it combines |
| the val_bpc-winning RWSP with the new split-Q induction primitive. |
| |
| If split-Q fails, the math memo §7.5 hypothesis ("contamination, not depth") is |
| also refuted. That would suggest induction at synth is bounded by something else |
| (embedding orthogonality? capacity?). |
| |
| ## Implementation (this session) |
| |
| - `synth_induction_train.py:198-228` — `TinyAttention.__init__` accepts |
| `use_split_q` and `split_q_frac` (default 0.5). |
| - `synth_induction_train.py:241-269` — `forward()` branches on `use_split_q`: |
| computes `Q_curr_full` from `x` and `Q_prev_full` from `F.pad(x[:, :-1])`, |
| then concatenates first `(1-frac)*H` heads of curr with last `frac*H` of prev. |
| - `synth_induction_train.py:310-330` — `TinyBlock` plumbs flags. |
| - `synth_induction_train.py:335-365` — `TinyLM` plumbs `split_q_layers` (subset |
| of layers to apply split-Q to). |
| - `synth_induction_train.py:539-554` — 5 new conditions: E1-SQ, E1-SQ0, |
| E1-SQ25, E1-SQs, E1-SQR. |
| |
| ## Results |
| |
| ### 500-step smoke (E1-SQ0) |
| | Metric | Value | |
| |---|---| |
| | Top per-head IPMR | 6.52% (L0H2) | |
| | Global IPMR | 22.28% | |
| | copy_acc | 1.0% | |
|
|
| ### 3K-step run (E1-SQ0, full match to prior Q-shift evaluation) |
| | Metric | Value | E1-Q0 @ 5K (refuted baseline) | |
| |---|---|---| |
| | Top per-head IPMR | **3.80% (L0H0)** | 4.89% (L1H2) | |
| | Global IPMR | **20.04%** | 18.8% | |
| | copy_acc | 2.9% | 4.6% | |
| |
| **Result**: Split-Q at L0 only gets WORSE on top per-head IPMR with longer training |
| (6.52% @ 500 → 3.80% @ 3K). The early induction structure visible at 500 steps |
| *degrades* as training progresses. Global IPMR is slightly better than Q-shift |
| (20.04% vs 18.8%) but top per-head is worse (3.80% vs 4.89%). |
| |
| **Interpretation**: gumbel+sign-STE training finds a local minimum that |
| *doesn't* keep the prev-token Q heads doing induction. The "clean prev |
| subspace" exists architecturally but the optimizer doesn't drive those heads |
| toward induction-aligned `W_Q^T W_K` when it can route capacity elsewhere. |
| |
| This is a NEGATIVE refutation of math memo §7.5's "contamination, not depth" |
| hypothesis. The contamination hypothesis predicts cleaner subspace → better |
| induction; instead, both Q-shift and split-Q hover around 4-5% top with |
| mediocre global IPMR. |
| |
| ### Final 3K-step apples-to-apples comparison (--no-alibi, gumbel + ±1) |
| |
| | Condition | Top per-head IPMR | Global IPMR | copy_acc | |
| |---|---|---|---| |
| | E1-Q0 (Q-shift L0) | **4.35%** (L1H0/L0H2) | 20.11% | 3.0% | |
| | E1-SQ0 (split-Q L0) | 3.80% (L0H0) | 20.04% | 2.9% | |
| | E1-SQ (split-Q all) | **4.35%** (L0H3) | 20.38% | (~3%) | |
|
|
| **Verdict: REFUTED.** Split-Q does not improve over Q-shift at synth scale. |
| Both produce indistinguishable induction quality (top ~4%, global ~20%). |
|
|
| This refutes the math memo §7.5 contamination hypothesis. If "current-vs-prev |
| contamination" were the bottleneck, the architecturally clean separation in |
| split-Q should have produced clearly better induction. It didn't. |
|
|
| ## Revised hypothesis: the bottleneck is *embedding orthogonality*, not architecture |
|
|
| All Q-shift family conditions (vanilla, split, K+Q combined) plateau at the |
| same 4-5% top-per-head IPMR. The induction signal `e[t-1]^T A e[k]` requires |
| `A` such that `e_v^T A e_u ≈ d · 𝟙[v=u]` — but at d=128 with vocab=64 ±1 |
| embeddings, `e_v^T e_u ~ Bernoulli(d, 1/2)` has O(√d) ≈ 11 noise. The diagonal |
| peak `e_v^T A e_v ≈ d = 128` is only an O(√d / d) = O(1/√d) ≈ 9% relative |
| signal-to-noise ratio. |
|
|
| This bounds how peaked the induction probabilities can get regardless of |
| architecture. To break past ~5% per-head IPMR at d=128, we'd need: |
|
|
| 1. **Larger d** (→ d=256 or d=512 brings SNR to 16-23%) |
| 2. **Orthogonalized embeddings** (a regularizer that pushes e_v ⊥ e_u for v≠u) |
| 3. **Both** combined |
|
|
| Both are testable at synth scale. The §7.5 split-Q architecture is NOT the |
| bottleneck — it's the noise floor in the embedding-bilinear product. |
|
|
| ## Implication for 300M v76 work |
|
|
| At 300M, d_model=1536 (12× larger than synth d=128). The expected SNR scales |
| as √d, so the induction signal should be ~3.5× cleaner than at synth. This |
| matches the empirical observation: the 300M v76 cold-start gets 10.21% top |
| per-head with 15 specialists ≥5%, while synth caps at ~5%. |
| |
| The 75M synth-tested Q-shift / split-Q / K-shift are likely all "below the |
| noise floor at d=128" — they only show benefits when d is large enough. |
| |
| ## Status |
| |
| NEGATIVE RESULT. Split-Q implemented and tested; refuted at synth. Not worth |
| running at 75M scale based on the synth signal. The math memo's §7.5 fix |
| hypothesis is wrong; the right next step is testing embedding orthogonality |
| or running at d=256+ synth scale. |
| |
| ## Open questions |
| |
| 1. Does the gain hold at 3K steps (matched to prior Q-shift evaluation)? |
| 2. Does E1-SQR (split-Q + RWSP) beat E1-LR0 (the current synth 1-bit ceiling |
| without softmax+fp32)? |
| 3. Does the result transfer to 75M FineWeb-Edu? Synth IPMR has been a poor |
| predictor at scale (75M v76 RWSP+K-shift wins val_bpc despite no induction |
| improvement). Need a 75M run to test. |
|
|