File size: 7,047 Bytes
4754707
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# Split-Q architecture: synth-scale prototype

## Motivation

From `v76_math_derivation.md` §7.5: vanilla Q-shift (`Q[t] = W_Q(x[t] + x[t-1])`)
collapses the induction bilinear from 4-deep to 2-deep, but the Q-shift conditions
all *failed* the Math-G1 gate at synth scale (3-5% copy_acc, 19-21% global IPMR).

**Why Q-shift failed**: with `Q[t] = W_Q(x[t] + x[t-1])`, the QK product expands to
```
Q[t]·K[k] = e[t]^T A e[k]    ← current-token noise (depth-2)
          + e[t-1]^T A e[k]   ← INDUCTION (depth-2)
          + (other terms)
```
Both relevant terms have the same depth, so the model can't easily learn `A` to
upweight the induction term and downweight the current-token noise term. The
"current-vs-current" contamination is the bottleneck, not the bilinear depth.

## Split-Q proposal (math memo §7.5)

Split heads into two groups:
- **Current heads** (first `(1-f)·H`): receive `Q[t] = W_Q · x[t]`
- **Prev heads** (last `f·H`): receive `Q[t] = W_Q · x[t-1]`

Each group has its own residual subspace (the embedding dim of those heads).
The prev heads' QK product is `e[t-1]^T A e[k]` — depth-2 AND uncontaminated by
current-token info, because those heads simply don't see `x[t]`.

Implementation cost: 2× q_proj forward (one for `x`, one for `x[t-1]`); same
parameter count. Could be optimized by splitting `W_Q` itself, but identity-matrix
mixing is simpler and matches existing param semantics.

## Predictions (vs Q-shift baseline, --no-alibi, gumbel+±1)

If split-Q's "clean prev subspace" hypothesis is right:

1. **E1-SQ0 top per-head IPMR > E1-Q0's 4.89%** at matched training steps.
2. **E1-SQ0 global IPMR > E1-Q0's 18.8%** (more heads find structure).
3. **copy_acc may stay low** at synth scale because the task isn't only induction
   — needs the rest of the architecture (FFN, output codebook) to actually copy.
   Best current 1-bit synth E1-LR0 only reaches 71-76% copy_acc at 5K steps.
4. **E1-SQR (split-Q + RWSP)** should beat E1-LR0 (5.98% top) since it combines
   the val_bpc-winning RWSP with the new split-Q induction primitive.

If split-Q fails, the math memo §7.5 hypothesis ("contamination, not depth") is
also refuted. That would suggest induction at synth is bounded by something else
(embedding orthogonality? capacity?).

## Implementation (this session)

- `synth_induction_train.py:198-228` — `TinyAttention.__init__` accepts
  `use_split_q` and `split_q_frac` (default 0.5).
- `synth_induction_train.py:241-269` — `forward()` branches on `use_split_q`:
  computes `Q_curr_full` from `x` and `Q_prev_full` from `F.pad(x[:, :-1])`,
  then concatenates first `(1-frac)*H` heads of curr with last `frac*H` of prev.
- `synth_induction_train.py:310-330` — `TinyBlock` plumbs flags.
- `synth_induction_train.py:335-365` — `TinyLM` plumbs `split_q_layers` (subset
  of layers to apply split-Q to).
- `synth_induction_train.py:539-554` — 5 new conditions: E1-SQ, E1-SQ0,
  E1-SQ25, E1-SQs, E1-SQR.

## Results

### 500-step smoke (E1-SQ0)
| Metric | Value |
|---|---|
| Top per-head IPMR | 6.52% (L0H2) |
| Global IPMR | 22.28% |
| copy_acc | 1.0% |

### 3K-step run (E1-SQ0, full match to prior Q-shift evaluation)
| Metric | Value | E1-Q0 @ 5K (refuted baseline) |
|---|---|---|
| Top per-head IPMR | **3.80% (L0H0)** | 4.89% (L1H2) |
| Global IPMR | **20.04%** | 18.8% |
| copy_acc | 2.9% | 4.6% |

**Result**: Split-Q at L0 only gets WORSE on top per-head IPMR with longer training
(6.52% @ 500 → 3.80% @ 3K). The early induction structure visible at 500 steps
*degrades* as training progresses. Global IPMR is slightly better than Q-shift
(20.04% vs 18.8%) but top per-head is worse (3.80% vs 4.89%).

**Interpretation**: gumbel+sign-STE training finds a local minimum that
*doesn't* keep the prev-token Q heads doing induction. The "clean prev
subspace" exists architecturally but the optimizer doesn't drive those heads
toward induction-aligned `W_Q^T W_K` when it can route capacity elsewhere.

This is a NEGATIVE refutation of math memo §7.5's "contamination, not depth"
hypothesis. The contamination hypothesis predicts cleaner subspace → better
induction; instead, both Q-shift and split-Q hover around 4-5% top with
mediocre global IPMR.

### Final 3K-step apples-to-apples comparison (--no-alibi, gumbel + ±1)

| Condition | Top per-head IPMR | Global IPMR | copy_acc |
|---|---|---|---|
| E1-Q0 (Q-shift L0)   | **4.35%** (L1H0/L0H2) | 20.11% | 3.0% |
| E1-SQ0 (split-Q L0)  | 3.80% (L0H0)           | 20.04% | 2.9% |
| E1-SQ  (split-Q all) | **4.35%** (L0H3)       | 20.38% | (~3%) |

**Verdict: REFUTED.** Split-Q does not improve over Q-shift at synth scale.
Both produce indistinguishable induction quality (top ~4%, global ~20%).

This refutes the math memo §7.5 contamination hypothesis. If "current-vs-prev
contamination" were the bottleneck, the architecturally clean separation in
split-Q should have produced clearly better induction. It didn't.

## Revised hypothesis: the bottleneck is *embedding orthogonality*, not architecture

All Q-shift family conditions (vanilla, split, K+Q combined) plateau at the
same 4-5% top-per-head IPMR. The induction signal `e[t-1]^T A e[k]` requires
`A` such that `e_v^T A e_u ≈ d · 𝟙[v=u]` — but at d=128 with vocab=64 ±1
embeddings, `e_v^T e_u ~ Bernoulli(d, 1/2)` has O(√d) ≈ 11 noise. The diagonal
peak `e_v^T A e_v ≈ d = 128` is only an O(√d / d) = O(1/√d) ≈ 9% relative
signal-to-noise ratio.

This bounds how peaked the induction probabilities can get regardless of
architecture. To break past ~5% per-head IPMR at d=128, we'd need:

1. **Larger d** (→ d=256 or d=512 brings SNR to 16-23%)
2. **Orthogonalized embeddings** (a regularizer that pushes e_v ⊥ e_u for v≠u)
3. **Both** combined

Both are testable at synth scale. The §7.5 split-Q architecture is NOT the
bottleneck — it's the noise floor in the embedding-bilinear product.

## Implication for 300M v76 work

At 300M, d_model=1536 (12× larger than synth d=128). The expected SNR scales
as √d, so the induction signal should be ~3.5× cleaner than at synth. This
matches the empirical observation: the 300M v76 cold-start gets 10.21% top
per-head with 15 specialists ≥5%, while synth caps at ~5%.

The 75M synth-tested Q-shift / split-Q / K-shift are likely all "below the
noise floor at d=128" — they only show benefits when d is large enough.

## Status

NEGATIVE RESULT. Split-Q implemented and tested; refuted at synth. Not worth
running at 75M scale based on the synth signal. The math memo's §7.5 fix
hypothesis is wrong; the right next step is testing embedding orthogonality
or running at d=256+ synth scale.

## Open questions

1. Does the gain hold at 3K steps (matched to prior Q-shift evaluation)?
2. Does E1-SQR (split-Q + RWSP) beat E1-LR0 (the current synth 1-bit ceiling
   without softmax+fp32)?
3. Does the result transfer to 75M FineWeb-Edu? Synth IPMR has been a poor
   predictor at scale (75M v76 RWSP+K-shift wins val_bpc despite no induction
   improvement). Need a 75M run to test.