Split-Q architecture: synth-scale prototype
Motivation
From v76_math_derivation.md §7.5: vanilla Q-shift (Q[t] = W_Q(x[t] + x[t-1]))
collapses the induction bilinear from 4-deep to 2-deep, but the Q-shift conditions
all failed the Math-G1 gate at synth scale (3-5% copy_acc, 19-21% global IPMR).
Why Q-shift failed: with Q[t] = W_Q(x[t] + x[t-1]), the QK product expands to
Q[t]·K[k] = e[t]^T A e[k] ← current-token noise (depth-2)
+ e[t-1]^T A e[k] ← INDUCTION (depth-2)
+ (other terms)
Both relevant terms have the same depth, so the model can't easily learn A to
upweight the induction term and downweight the current-token noise term. The
"current-vs-current" contamination is the bottleneck, not the bilinear depth.
Split-Q proposal (math memo §7.5)
Split heads into two groups:
- Current heads (first
(1-f)·H): receiveQ[t] = W_Q · x[t] - Prev heads (last
f·H): receiveQ[t] = W_Q · x[t-1]
Each group has its own residual subspace (the embedding dim of those heads).
The prev heads' QK product is e[t-1]^T A e[k] — depth-2 AND uncontaminated by
current-token info, because those heads simply don't see x[t].
Implementation cost: 2× q_proj forward (one for x, one for x[t-1]); same
parameter count. Could be optimized by splitting W_Q itself, but identity-matrix
mixing is simpler and matches existing param semantics.
Predictions (vs Q-shift baseline, --no-alibi, gumbel+±1)
If split-Q's "clean prev subspace" hypothesis is right:
- E1-SQ0 top per-head IPMR > E1-Q0's 4.89% at matched training steps.
- E1-SQ0 global IPMR > E1-Q0's 18.8% (more heads find structure).
- copy_acc may stay low at synth scale because the task isn't only induction — needs the rest of the architecture (FFN, output codebook) to actually copy. Best current 1-bit synth E1-LR0 only reaches 71-76% copy_acc at 5K steps.
- E1-SQR (split-Q + RWSP) should beat E1-LR0 (5.98% top) since it combines the val_bpc-winning RWSP with the new split-Q induction primitive.
If split-Q fails, the math memo §7.5 hypothesis ("contamination, not depth") is also refuted. That would suggest induction at synth is bounded by something else (embedding orthogonality? capacity?).
Implementation (this session)
synth_induction_train.py:198-228—TinyAttention.__init__acceptsuse_split_qandsplit_q_frac(default 0.5).synth_induction_train.py:241-269—forward()branches onuse_split_q: computesQ_curr_fullfromxandQ_prev_fullfromF.pad(x[:, :-1]), then concatenates first(1-frac)*Hheads of curr with lastfrac*Hof prev.synth_induction_train.py:310-330—TinyBlockplumbs flags.synth_induction_train.py:335-365—TinyLMplumbssplit_q_layers(subset of layers to apply split-Q to).synth_induction_train.py:539-554— 5 new conditions: E1-SQ, E1-SQ0, E1-SQ25, E1-SQs, E1-SQR.
Results
500-step smoke (E1-SQ0)
| Metric | Value |
|---|---|
| Top per-head IPMR | 6.52% (L0H2) |
| Global IPMR | 22.28% |
| copy_acc | 1.0% |
3K-step run (E1-SQ0, full match to prior Q-shift evaluation)
| Metric | Value | E1-Q0 @ 5K (refuted baseline) |
|---|---|---|
| Top per-head IPMR | 3.80% (L0H0) | 4.89% (L1H2) |
| Global IPMR | 20.04% | 18.8% |
| copy_acc | 2.9% | 4.6% |
Result: Split-Q at L0 only gets WORSE on top per-head IPMR with longer training (6.52% @ 500 → 3.80% @ 3K). The early induction structure visible at 500 steps degrades as training progresses. Global IPMR is slightly better than Q-shift (20.04% vs 18.8%) but top per-head is worse (3.80% vs 4.89%).
Interpretation: gumbel+sign-STE training finds a local minimum that
doesn't keep the prev-token Q heads doing induction. The "clean prev
subspace" exists architecturally but the optimizer doesn't drive those heads
toward induction-aligned W_Q^T W_K when it can route capacity elsewhere.
This is a NEGATIVE refutation of math memo §7.5's "contamination, not depth" hypothesis. The contamination hypothesis predicts cleaner subspace → better induction; instead, both Q-shift and split-Q hover around 4-5% top with mediocre global IPMR.
Final 3K-step apples-to-apples comparison (--no-alibi, gumbel + ±1)
| Condition | Top per-head IPMR | Global IPMR | copy_acc |
|---|---|---|---|
| E1-Q0 (Q-shift L0) | 4.35% (L1H0/L0H2) | 20.11% | 3.0% |
| E1-SQ0 (split-Q L0) | 3.80% (L0H0) | 20.04% | 2.9% |
| E1-SQ (split-Q all) | 4.35% (L0H3) | 20.38% | (~3%) |
Verdict: REFUTED. Split-Q does not improve over Q-shift at synth scale. Both produce indistinguishable induction quality (top ~4%, global ~20%).
This refutes the math memo §7.5 contamination hypothesis. If "current-vs-prev contamination" were the bottleneck, the architecturally clean separation in split-Q should have produced clearly better induction. It didn't.
Revised hypothesis: the bottleneck is embedding orthogonality, not architecture
All Q-shift family conditions (vanilla, split, K+Q combined) plateau at the
same 4-5% top-per-head IPMR. The induction signal e[t-1]^T A e[k] requires
A such that e_v^T A e_u ≈ d · 𝟙[v=u] — but at d=128 with vocab=64 ±1
embeddings, e_v^T e_u ~ Bernoulli(d, 1/2) has O(√d) ≈ 11 noise. The diagonal
peak e_v^T A e_v ≈ d = 128 is only an O(√d / d) = O(1/√d) ≈ 9% relative
signal-to-noise ratio.
This bounds how peaked the induction probabilities can get regardless of architecture. To break past ~5% per-head IPMR at d=128, we'd need:
- Larger d (→ d=256 or d=512 brings SNR to 16-23%)
- Orthogonalized embeddings (a regularizer that pushes e_v ⊥ e_u for v≠u)
- Both combined
Both are testable at synth scale. The §7.5 split-Q architecture is NOT the bottleneck — it's the noise floor in the embedding-bilinear product.
Implication for 300M v76 work
At 300M, d_model=1536 (12× larger than synth d=128). The expected SNR scales as √d, so the induction signal should be ~3.5× cleaner than at synth. This matches the empirical observation: the 300M v76 cold-start gets 10.21% top per-head with 15 specialists ≥5%, while synth caps at ~5%.
The 75M synth-tested Q-shift / split-Q / K-shift are likely all "below the noise floor at d=128" — they only show benefits when d is large enough.
Status
NEGATIVE RESULT. Split-Q implemented and tested; refuted at synth. Not worth running at 75M scale based on the synth signal. The math memo's §7.5 fix hypothesis is wrong; the right next step is testing embedding orthogonality or running at d=256+ synth scale.
Open questions
- Does the gain hold at 3K steps (matched to prior Q-shift evaluation)?
- Does E1-SQR (split-Q + RWSP) beat E1-LR0 (the current synth 1-bit ceiling without softmax+fp32)?
- Does the result transfer to 75M FineWeb-Edu? Synth IPMR has been a poor predictor at scale (75M v76 RWSP+K-shift wins val_bpc despite no induction improvement). Need a 75M run to test.