File size: 7,047 Bytes
4754707 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | # Split-Q architecture: synth-scale prototype
## Motivation
From `v76_math_derivation.md` §7.5: vanilla Q-shift (`Q[t] = W_Q(x[t] + x[t-1])`)
collapses the induction bilinear from 4-deep to 2-deep, but the Q-shift conditions
all *failed* the Math-G1 gate at synth scale (3-5% copy_acc, 19-21% global IPMR).
**Why Q-shift failed**: with `Q[t] = W_Q(x[t] + x[t-1])`, the QK product expands to
```
Q[t]·K[k] = e[t]^T A e[k] ← current-token noise (depth-2)
+ e[t-1]^T A e[k] ← INDUCTION (depth-2)
+ (other terms)
```
Both relevant terms have the same depth, so the model can't easily learn `A` to
upweight the induction term and downweight the current-token noise term. The
"current-vs-current" contamination is the bottleneck, not the bilinear depth.
## Split-Q proposal (math memo §7.5)
Split heads into two groups:
- **Current heads** (first `(1-f)·H`): receive `Q[t] = W_Q · x[t]`
- **Prev heads** (last `f·H`): receive `Q[t] = W_Q · x[t-1]`
Each group has its own residual subspace (the embedding dim of those heads).
The prev heads' QK product is `e[t-1]^T A e[k]` — depth-2 AND uncontaminated by
current-token info, because those heads simply don't see `x[t]`.
Implementation cost: 2× q_proj forward (one for `x`, one for `x[t-1]`); same
parameter count. Could be optimized by splitting `W_Q` itself, but identity-matrix
mixing is simpler and matches existing param semantics.
## Predictions (vs Q-shift baseline, --no-alibi, gumbel+±1)
If split-Q's "clean prev subspace" hypothesis is right:
1. **E1-SQ0 top per-head IPMR > E1-Q0's 4.89%** at matched training steps.
2. **E1-SQ0 global IPMR > E1-Q0's 18.8%** (more heads find structure).
3. **copy_acc may stay low** at synth scale because the task isn't only induction
— needs the rest of the architecture (FFN, output codebook) to actually copy.
Best current 1-bit synth E1-LR0 only reaches 71-76% copy_acc at 5K steps.
4. **E1-SQR (split-Q + RWSP)** should beat E1-LR0 (5.98% top) since it combines
the val_bpc-winning RWSP with the new split-Q induction primitive.
If split-Q fails, the math memo §7.5 hypothesis ("contamination, not depth") is
also refuted. That would suggest induction at synth is bounded by something else
(embedding orthogonality? capacity?).
## Implementation (this session)
- `synth_induction_train.py:198-228` — `TinyAttention.__init__` accepts
`use_split_q` and `split_q_frac` (default 0.5).
- `synth_induction_train.py:241-269` — `forward()` branches on `use_split_q`:
computes `Q_curr_full` from `x` and `Q_prev_full` from `F.pad(x[:, :-1])`,
then concatenates first `(1-frac)*H` heads of curr with last `frac*H` of prev.
- `synth_induction_train.py:310-330` — `TinyBlock` plumbs flags.
- `synth_induction_train.py:335-365` — `TinyLM` plumbs `split_q_layers` (subset
of layers to apply split-Q to).
- `synth_induction_train.py:539-554` — 5 new conditions: E1-SQ, E1-SQ0,
E1-SQ25, E1-SQs, E1-SQR.
## Results
### 500-step smoke (E1-SQ0)
| Metric | Value |
|---|---|
| Top per-head IPMR | 6.52% (L0H2) |
| Global IPMR | 22.28% |
| copy_acc | 1.0% |
### 3K-step run (E1-SQ0, full match to prior Q-shift evaluation)
| Metric | Value | E1-Q0 @ 5K (refuted baseline) |
|---|---|---|
| Top per-head IPMR | **3.80% (L0H0)** | 4.89% (L1H2) |
| Global IPMR | **20.04%** | 18.8% |
| copy_acc | 2.9% | 4.6% |
**Result**: Split-Q at L0 only gets WORSE on top per-head IPMR with longer training
(6.52% @ 500 → 3.80% @ 3K). The early induction structure visible at 500 steps
*degrades* as training progresses. Global IPMR is slightly better than Q-shift
(20.04% vs 18.8%) but top per-head is worse (3.80% vs 4.89%).
**Interpretation**: gumbel+sign-STE training finds a local minimum that
*doesn't* keep the prev-token Q heads doing induction. The "clean prev
subspace" exists architecturally but the optimizer doesn't drive those heads
toward induction-aligned `W_Q^T W_K` when it can route capacity elsewhere.
This is a NEGATIVE refutation of math memo §7.5's "contamination, not depth"
hypothesis. The contamination hypothesis predicts cleaner subspace → better
induction; instead, both Q-shift and split-Q hover around 4-5% top with
mediocre global IPMR.
### Final 3K-step apples-to-apples comparison (--no-alibi, gumbel + ±1)
| Condition | Top per-head IPMR | Global IPMR | copy_acc |
|---|---|---|---|
| E1-Q0 (Q-shift L0) | **4.35%** (L1H0/L0H2) | 20.11% | 3.0% |
| E1-SQ0 (split-Q L0) | 3.80% (L0H0) | 20.04% | 2.9% |
| E1-SQ (split-Q all) | **4.35%** (L0H3) | 20.38% | (~3%) |
**Verdict: REFUTED.** Split-Q does not improve over Q-shift at synth scale.
Both produce indistinguishable induction quality (top ~4%, global ~20%).
This refutes the math memo §7.5 contamination hypothesis. If "current-vs-prev
contamination" were the bottleneck, the architecturally clean separation in
split-Q should have produced clearly better induction. It didn't.
## Revised hypothesis: the bottleneck is *embedding orthogonality*, not architecture
All Q-shift family conditions (vanilla, split, K+Q combined) plateau at the
same 4-5% top-per-head IPMR. The induction signal `e[t-1]^T A e[k]` requires
`A` such that `e_v^T A e_u ≈ d · 𝟙[v=u]` — but at d=128 with vocab=64 ±1
embeddings, `e_v^T e_u ~ Bernoulli(d, 1/2)` has O(√d) ≈ 11 noise. The diagonal
peak `e_v^T A e_v ≈ d = 128` is only an O(√d / d) = O(1/√d) ≈ 9% relative
signal-to-noise ratio.
This bounds how peaked the induction probabilities can get regardless of
architecture. To break past ~5% per-head IPMR at d=128, we'd need:
1. **Larger d** (→ d=256 or d=512 brings SNR to 16-23%)
2. **Orthogonalized embeddings** (a regularizer that pushes e_v ⊥ e_u for v≠u)
3. **Both** combined
Both are testable at synth scale. The §7.5 split-Q architecture is NOT the
bottleneck — it's the noise floor in the embedding-bilinear product.
## Implication for 300M v76 work
At 300M, d_model=1536 (12× larger than synth d=128). The expected SNR scales
as √d, so the induction signal should be ~3.5× cleaner than at synth. This
matches the empirical observation: the 300M v76 cold-start gets 10.21% top
per-head with 15 specialists ≥5%, while synth caps at ~5%.
The 75M synth-tested Q-shift / split-Q / K-shift are likely all "below the
noise floor at d=128" — they only show benefits when d is large enough.
## Status
NEGATIVE RESULT. Split-Q implemented and tested; refuted at synth. Not worth
running at 75M scale based on the synth signal. The math memo's §7.5 fix
hypothesis is wrong; the right next step is testing embedding orthogonality
or running at d=256+ synth scale.
## Open questions
1. Does the gain hold at 3K steps (matched to prior Q-shift evaluation)?
2. Does E1-SQR (split-Q + RWSP) beat E1-LR0 (the current synth 1-bit ceiling
without softmax+fp32)?
3. Does the result transfer to 75M FineWeb-Edu? Synth IPMR has been a poor
predictor at scale (75M v76 RWSP+K-shift wins val_bpc despite no induction
improvement). Need a 75M run to test.
|