bitnet-1bitllm / notes /split_q_synth.md

1bitllm code (checkpoints to follow)

4754707 verified 13 days ago

7.05 kB

	# Split-Q architecture: synth-scale prototype

	## Motivation

	From `v76_math_derivation.md` §7.5: vanilla Q-shift (`Q[t] = W_Q(x[t] + x[t-1])`)
	collapses the induction bilinear from 4-deep to 2-deep, but the Q-shift conditions
	all failed the Math-G1 gate at synth scale (3-5% copy_acc, 19-21% global IPMR).

	Why Q-shift failed: with `Q[t] = W_Q(x[t] + x[t-1])`, the QK product expands to
	```
	Q[t]·K[k] = e[t]^T A e[k] ← current-token noise (depth-2)
	+ e[t-1]^T A e[k] ← INDUCTION (depth-2)
	+ (other terms)
	```
	Both relevant terms have the same depth, so the model can't easily learn `A` to
	upweight the induction term and downweight the current-token noise term. The
	"current-vs-current" contamination is the bottleneck, not the bilinear depth.

	## Split-Q proposal (math memo §7.5)

	Split heads into two groups:
	- Current heads (first `(1-f)·H`): receive `Q[t] = W_Q · x[t]`
	- Prev heads (last `f·H`): receive `Q[t] = W_Q · x[t-1]`

	Each group has its own residual subspace (the embedding dim of those heads).
	The prev heads' QK product is `e[t-1]^T A e[k]` — depth-2 AND uncontaminated by
	current-token info, because those heads simply don't see `x[t]`.

	Implementation cost: 2× q_proj forward (one for `x`, one for `x[t-1]`); same
	parameter count. Could be optimized by splitting `W_Q` itself, but identity-matrix
	mixing is simpler and matches existing param semantics.

	## Predictions (vs Q-shift baseline, --no-alibi, gumbel+±1)

	If split-Q's "clean prev subspace" hypothesis is right:

	1. E1-SQ0 top per-head IPMR > E1-Q0's 4.89% at matched training steps.
	2. E1-SQ0 global IPMR > E1-Q0's 18.8% (more heads find structure).
	3. copy_acc may stay low at synth scale because the task isn't only induction
	— needs the rest of the architecture (FFN, output codebook) to actually copy.
	Best current 1-bit synth E1-LR0 only reaches 71-76% copy_acc at 5K steps.
	4. E1-SQR (split-Q + RWSP) should beat E1-LR0 (5.98% top) since it combines
	the val_bpc-winning RWSP with the new split-Q induction primitive.

	If split-Q fails, the math memo §7.5 hypothesis ("contamination, not depth") is
	also refuted. That would suggest induction at synth is bounded by something else
	(embedding orthogonality? capacity?).

	## Implementation (this session)

	- `synth_induction_train.py:198-228` — `TinyAttention.__init__` accepts
	`use_split_q` and `split_q_frac` (default 0.5).
	- `synth_induction_train.py:241-269` — `forward()` branches on `use_split_q`:
	computes `Q_curr_full` from `x` and `Q_prev_full` from `F.pad(x[:, :-1])`,
	then concatenates first `(1-frac)H` heads of curr with last `fracH` of prev.
	- `synth_induction_train.py:310-330` — `TinyBlock` plumbs flags.
	- `synth_induction_train.py:335-365` — `TinyLM` plumbs `split_q_layers` (subset
	of layers to apply split-Q to).
	- `synth_induction_train.py:539-554` — 5 new conditions: E1-SQ, E1-SQ0,
	E1-SQ25, E1-SQs, E1-SQR.

	## Results

	### 500-step smoke (E1-SQ0)
	\| Metric \| Value \|
	\|---\|---\|
	\| Top per-head IPMR \| 6.52% (L0H2) \|
	\| Global IPMR \| 22.28% \|
	\| copy_acc \| 1.0% \|

	### 3K-step run (E1-SQ0, full match to prior Q-shift evaluation)
	\| Metric \| Value \| E1-Q0 @ 5K (refuted baseline) \|
	\|---\|---\|---\|
	\| Top per-head IPMR \| 3.80% (L0H0) \| 4.89% (L1H2) \|
	\| Global IPMR \| 20.04% \| 18.8% \|
	\| copy_acc \| 2.9% \| 4.6% \|

	Result: Split-Q at L0 only gets WORSE on top per-head IPMR with longer training
	(6.52% @ 500 → 3.80% @ 3K). The early induction structure visible at 500 steps
	degrades as training progresses. Global IPMR is slightly better than Q-shift
	(20.04% vs 18.8%) but top per-head is worse (3.80% vs 4.89%).

	Interpretation: gumbel+sign-STE training finds a local minimum that
	doesn't keep the prev-token Q heads doing induction. The "clean prev
	subspace" exists architecturally but the optimizer doesn't drive those heads
	toward induction-aligned `W_Q^T W_K` when it can route capacity elsewhere.

	This is a NEGATIVE refutation of math memo §7.5's "contamination, not depth"
	hypothesis. The contamination hypothesis predicts cleaner subspace → better
	induction; instead, both Q-shift and split-Q hover around 4-5% top with
	mediocre global IPMR.

	### Final 3K-step apples-to-apples comparison (--no-alibi, gumbel + ±1)

	\| Condition \| Top per-head IPMR \| Global IPMR \| copy_acc \|
	\|---\|---\|---\|---\|
	\| E1-Q0 (Q-shift L0) \| 4.35% (L1H0/L0H2) \| 20.11% \| 3.0% \|
	\| E1-SQ0 (split-Q L0) \| 3.80% (L0H0) \| 20.04% \| 2.9% \|
	\| E1-SQ (split-Q all) \| 4.35% (L0H3) \| 20.38% \| (~3%) \|

	Verdict: REFUTED. Split-Q does not improve over Q-shift at synth scale.
	Both produce indistinguishable induction quality (top ~4%, global ~20%).

	This refutes the math memo §7.5 contamination hypothesis. If "current-vs-prev
	contamination" were the bottleneck, the architecturally clean separation in
	split-Q should have produced clearly better induction. It didn't.

	## Revised hypothesis: the bottleneck is embedding orthogonality, not architecture

	All Q-shift family conditions (vanilla, split, K+Q combined) plateau at the
	same 4-5% top-per-head IPMR. The induction signal `e[t-1]^T A e[k]` requires
	`A` such that `e_v^T A e_u ≈ d · 𝟙[v=u]` — but at d=128 with vocab=64 ±1
	embeddings, `e_v^T e_u ~ Bernoulli(d, 1/2)` has O(√d) ≈ 11 noise. The diagonal
	peak `e_v^T A e_v ≈ d = 128` is only an O(√d / d) = O(1/√d) ≈ 9% relative
	signal-to-noise ratio.

	This bounds how peaked the induction probabilities can get regardless of
	architecture. To break past ~5% per-head IPMR at d=128, we'd need:

	1. Larger d (→ d=256 or d=512 brings SNR to 16-23%)
	2. Orthogonalized embeddings (a regularizer that pushes e_v ⊥ e_u for v≠u)
	3. Both combined

	Both are testable at synth scale. The §7.5 split-Q architecture is NOT the
	bottleneck — it's the noise floor in the embedding-bilinear product.

	## Implication for 300M v76 work

	At 300M, d_model=1536 (12× larger than synth d=128). The expected SNR scales
	as √d, so the induction signal should be ~3.5× cleaner than at synth. This
	matches the empirical observation: the 300M v76 cold-start gets 10.21% top
	per-head with 15 specialists ≥5%, while synth caps at ~5%.

	The 75M synth-tested Q-shift / split-Q / K-shift are likely all "below the
	noise floor at d=128" — they only show benefits when d is large enough.

	## Status

	NEGATIVE RESULT. Split-Q implemented and tested; refuted at synth. Not worth
	running at 75M scale based on the synth signal. The math memo's §7.5 fix
	hypothesis is wrong; the right next step is testing embedding orthogonality
	or running at d=256+ synth scale.

	## Open questions

	1. Does the gain hold at 3K steps (matched to prior Q-shift evaluation)?
	2. Does E1-SQR (split-Q + RWSP) beat E1-LR0 (the current synth 1-bit ceiling
	without softmax+fp32)?
	3. Does the result transfer to 75M FineWeb-Edu? Synth IPMR has been a poor
	predictor at scale (75M v76 RWSP+K-shift wins val_bpc despite no induction
	improvement). Need a 75M run to test.