parrishcorcoran commited on
Commit
3d5e0cf
Β·
verified Β·
1 Parent(s): 28fb0c0

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +241 -0
README.md ADDED
@@ -0,0 +1,241 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - llm-inference
5
+ - speculative-decoding
6
+ - medusa
7
+ - bitnet
8
+ - adaptive-compute
9
+ - efficiency
10
+ - physics-informed
11
+ datasets:
12
+ - parrishcorcoran/MedusaBitNet-48seq-cache
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # unified-gate
17
+
18
+ > **LLM inference is overbudgeted by ~1000Γ—. The per-token difficulty signal lives on a ~7-dimensional manifold. We measured it. This is the gate.**
19
+
20
+ - **Code & training pipeline**: [github.com/parrishcorcoran/unified-gate](https://github.com/parrishcorcoran/unified-gate)
21
+ - **Research apparatus**: [github.com/parrishcorcoran/MedusaBitNet](https://github.com/parrishcorcoran/MedusaBitNet)
22
+ - **Companion inference efficiency thesis** (theory): `THEORY.md` in the GitHub repo
23
+ - **26 KB deployment artifact**: `gate_k20.pt` (included here)
24
+
25
+ ---
26
+
27
+ ## The one-minute pitch
28
+
29
+ Every speculative-decoding / early-exit / Medusa / adaptive-compute paper of the last three years is *the same sensor in a different costume* measuring *one underlying signal*: how sharp is the next-token distribution. The field keeps shipping new sensors and never builds the *controller* that fuses them.
30
+
31
+ This is the controller. It's a 20-feature, 64Γ—64 MLP (26 KB) that decides, per token, whether to accept a cheap draft or run the full backbone. Held-out measurement on BitNet b1.58 2B: **10.6% skip at 95% fidelity**, 14.1% skip at 90% fidelity (peak K=40-50, replicated Β±0.3% over 5 seeds).
32
+
33
+ The *provocative* claim is not the skip rate. It's the dimensionality: the per-token difficulty surface is **~7-dimensional**, measured by TwoNN on final-layer hidden states, across two architectures (BitNet 2B + Llama 3.1 8B). That's a physics-grounded ceiling, not an engineering target. It says per-token decision-making has a compute floor and we're nowhere near it.
34
+
35
+ ---
36
+
37
+ ## The three claims, each measured
38
+
39
+ ### 1. The information is on a thin surface, not in the bulk
40
+
41
+ Running 30-layer Γ— 2560-dim backbone computation for every token is redundant with what Medusa heads already read off the cached hidden state. That's the holographic principle applied to transformer inference β€” the heads are empirical proof the future tokens were already on the surface. Bulk volume is being recomputed from boundary data per step.
42
+
43
+ ### 2. Compute and entropy are inversely correlated
44
+
45
+ Conditional next-token entropy *decreases* with context length (cloud tightens as context locks in plausible completions). Transformer compute per token *increases* with context length (O(NΒ²) attention, bigger KV cache). Current decoders scale compute up exactly when information requirement scales down. RNNs had the right compute shape β€” we traded it for capacity.
46
+
47
+ ### 3. The gate's dimensionality is set by physics
48
+
49
+ Per-sequence intrinsic dim of final-layer hidden states, measured by TwoNN (Facco et al. 2017):
50
+
51
+ | Model | Ambient dim | Per-seq intrinsic |
52
+ |---|---|---|
53
+ | BitNet b1.58 2B (result_norm) | 2560 | **7.3** |
54
+ | Llama 3.1 8B Q4_K_M (result_norm) | 4096 | **6.9** |
55
+
56
+ Second cross-model metric: raw hidden-state participation ratio divided by ambient dim:
57
+
58
+ | Model | PR | PR / ambient |
59
+ |---|---|---|
60
+ | BitNet 2B | 85 | **3.3%** |
61
+ | Llama 3.1 8B | 151 | **3.7%** |
62
+
63
+ Two independent measurements agreeing that both models concentrate per-token decision-making into ~7 dimensions out of thousands. When we train the gate on top-K features ranked by gradient importance, **K=7 recovers ~70% of the K=50 peak skip**. The engineering knee of the feature-count curve lands exactly at the physics ceiling.
64
+
65
+ ---
66
+
67
+ ## The measurement
68
+
69
+ 5-seed K-sweep on the BitNet 2B held-out set. skip at Ξ»=0.95 fidelity (mean Β± std):
70
+
71
+ ```
72
+ K skip@Ξ»=0.95 Οƒ-gap vs K=70
73
+ 7 7.3% (single) (matches per-seq intrinsic dim, 80% of peak)
74
+ 15 9.2% Β± 0.3% -2.4Οƒ (lower, expected)
75
+ 20 9.8% Β± 0.2% 0.1Οƒ (matches K=70)
76
+ 25 10.1% Β± 0.2% +1.1Οƒ
77
+ 30 10.5% Β± 0.3% +2.1Οƒ
78
+ 40 10.6% Β± 0.2% +3.2Οƒ ← peak
79
+ 50 10.7% Β± 0.2% +3.4Οƒ ← peak
80
+ 70 9.7% Β± 0.3% baseline
81
+ ```
82
+
83
+ **The K=70 bundle is over-parameterized.** Adding features past ~50 degrades the gate by ~9%, a ~3Οƒ effect replicated across seeds. This is the inference analog of *parameter count β‰  information content*: once you cross the per-seq manifold ceiling, extra features are just overfitting noise.
84
+
85
+ ---
86
+
87
+ ## Architecture (gate_k20.pt)
88
+
89
+ - **20 input features** selected by gradient importance from a 70-feature physics-aperture bundle
90
+ - **Two hidden layers** of 64 ReLU units each
91
+ - **Single sigmoid output** (skip probability)
92
+ - **~6,500 parameters**, 26 KB on disk
93
+ - **Calibrated thresholds** for λ ∈ {0.85, 0.90, 0.95, 0.99} bundled in the checkpoint
94
+
95
+ ### The 20 features
96
+
97
+ Ranked by gradient importance on held-out:
98
+
99
+ 1. `sup_1` β€” superposition effective rank (exp(entropy of top-K softmax))
100
+ 2. `cluster_1` β€” K-means soft-cluster entropy
101
+ 3. `logit_gap` β€” head-0 top1 minus top2 logit
102
+ 4. `content_conf` β€” head-0 top-1 softmax
103
+ 5. `cluster_0` β€” K-means min-distance-to-center
104
+ 6. `layer_5` β€” cos(h_5, h_15) Ryu-Takayanagi layer-wise similarity
105
+ 7. `layer_9` β€” layer-wise norm_15 (log)
106
+ 8. `layer_7` β€” cos(h_5, h_29)
107
+ 9. `top10_cov` β€” head-0 cumulative top-10 probability
108
+ 10. `treuse_2` β€” token-reuse rank within recent window (H2O lexical)
109
+ 11. `agreement_count` β€” head-0 arg-max matches head-k lagged
110
+ 12. `fe_1` β€” entropy-adjusted free-energy analog
111
+ 13. `rg_2` β€” renormalization-group divergence at scale 9
112
+ 14. `mom_0` β€” head-0 softmax 3rd moment (skewness)
113
+ 15. `vel_0` β€” hidden-state velocity β€–h_t βˆ’ h_{t-1}β€–
114
+ 16. `fe_0` β€” log(1 + 0.01 Β· cluster_mindist)
115
+ 17. `hnorm_0` β€” log(1 + β€–h_tβ€–)
116
+ 18. `layer_1` β€” log(1 + velocity 15β†’29)
117
+ 19. `nbr_0` β€” distance to nearest recent hidden state (H2O temporal)
118
+ 20. `sup_0` β€” top-K token-embedding spread in hidden space
119
+
120
+ Five framings from the theory thesis, each contributing:
121
+ - **Holographic** (cluster, neighborhood, free-energy)
122
+ - **Electron-cloud / superposition** (sup_spread, sup_eff_rank, moments)
123
+ - **Ryu-Takayanagi depth projection** (layer-wise 5/15/29 features β€” biggest single group)
124
+ - **H2O heavy-hitters** (token-reuse, neighborhood)
125
+ - **Renormalization group** (multi-scale coarse-graining divergence)
126
+ - **Base information-theory** (confidence, logit gap, covers, agreement)
127
+
128
+ ---
129
+
130
+ ## Usage
131
+
132
+ ```python
133
+ import torch
134
+ from unified_gate import Gate, extract_all_features
135
+
136
+ gate = Gate("gate_k20.pt")
137
+
138
+ # Per-sequence feature extraction
139
+ X = extract_all_features(
140
+ hidden_last=h29, # [T, H] final-layer result_norm, float32
141
+ hidden_mid=h15, # [T, H] middle layer
142
+ hidden_early=h5, # [T, H] early layer
143
+ head_logits=logits, # [T, K_heads, V] Medusa head logits
144
+ lm_head=lm_head_np, # [V, H] output embeddings
145
+ tokens=tokens, # [T] token ids
146
+ period_ids=period_ids, # precomputed from tokenizer
147
+ newline_ids=newline_ids,
148
+ cluster_centers=centers, # K=32 pre-fit centers
149
+ ) # returns [T-8, 70] float32
150
+
151
+ # Skip decision
152
+ scores = gate.score(X) # skip probability per token
153
+ skip_mask = gate.skip_mask(X, fidelity=0.95)
154
+ # Accept Medusa draft where skip_mask is True; re-run backbone where False.
155
+ ```
156
+
157
+ Install from GitHub:
158
+
159
+ ```bash
160
+ pip install git+https://github.com/parrishcorcoran/unified-gate.git
161
+ ```
162
+
163
+ Reproducibility:
164
+
165
+ ```bash
166
+ git clone https://github.com/parrishcorcoran/unified-gate
167
+ cd unified-gate
168
+ python scripts/reproduce.py --medusabitnet-root /path/to/MedusaBitNet
169
+ ```
170
+
171
+ Matches stored frontier within Β±0.001 absolute skip.
172
+
173
+ ---
174
+
175
+ ## Cross-model scope and limits
176
+
177
+ **Validated on**:
178
+ - BitNet b1.58 2B (primary training + held-out measurement)
179
+ - Llama 3.1 8B Q4_K_M (cross-model TwoNN intrinsic-dim agreement)
180
+
181
+ **Not yet validated on**:
182
+ - Wall-clock speedup on real hardware (the systems paper follow-up)
183
+ - Much larger models (70B+)
184
+ - Non-English / specialized domains
185
+
186
+ **Known limits**:
187
+ - The gate is trained on BitNet-specific Medusa head acceptance. Cross-model *deployment* requires retraining the 64Γ—64 MLP on target-model head acceptances. The *feature extractor* generalizes; the MLP weights don't.
188
+ - `gate_k20.pt`'s `agreement_count` feature is a 0/1 logical OR (numpy 2.x bool-add semantics in training pipeline) not a 0-3 count. A corrected retraining is on the v0.3 roadmap. In the measured frontier this is empirically fine β€” but it's a lurking name/semantics mismatch worth flagging.
189
+
190
+ ---
191
+
192
+ ## Theoretical framework
193
+
194
+ Six equivalent framings β€” not six different ideas, but one underlying insight seen from six angles:
195
+
196
+ 1. **Holographic principle / black-hole boundary layer** β€” information about the completion is on a thin surface of the hidden state, not in the bulk compute
197
+ 2. **Electron cloud / quantum probability** β€” there is no "correct" next token; the cloud *is* the observable
198
+ 3. **Fractal / hologram** β€” every per-token forward is a self-similar slice of one underlying trajectory computation
199
+ 4. **Compute-entropy inversion** β€” conditional entropy drops through the sequence while O(NΒ²) compute per token rises; they should be correlated, they're anti-correlated
200
+ 5. **Boundary layer** β€” predictability lives in a thin laminar region; only a minority of tokens are boundary-class
201
+ 6. **Unified sensor gate** β€” all existing techniques (draft, Medusa, early exit, N-gram, bottleneck) are redundant entropy sensors; the missing piece is the controller
202
+
203
+ Full thesis including the companion spin-glass-substrate framing and the tokens-per-joule thermodynamic argument is at `THEORY.md` in the GitHub repo.
204
+
205
+ ---
206
+
207
+ ## Roadmap
208
+
209
+ - **v0.3** β€” retrain gate with corrected `agreement_count` (0-3 count, not 0/1 OR)
210
+ - **v0.4** β€” Llama 3.1 8B Medusa-compatible gate (once heads are trained)
211
+ - **Paper 1** β€” this repo's measurement + theory (target: arXiv)
212
+ - **Paper 2** β€” wall-clock C++ integration (follow-up systems paper)
213
+ - **Fat-trunk / thin-branches architecture** β€” direct consequence of 7-dim finding: narrow late layers, full-width early layers. Experimentally justified but untested.
214
+
215
+ ---
216
+
217
+ ## Credits
218
+
219
+ - **Parrish Corcoran** β€” research direction, physics framework, experimental design
220
+ - **Claude Opus 4.6 (1M context)** β€” implementation, measurements, 24-hour autonomous research session (2026-04-15)
221
+
222
+ ---
223
+
224
+ ## License
225
+
226
+ MIT β€” research use encouraged.
227
+
228
+ ---
229
+
230
+ ## Citation
231
+
232
+ Preferred citation format until the paper lands:
233
+
234
+ ```bibtex
235
+ @software{corcoran_unified_gate_2026,
236
+ author = {Corcoran, Parrish},
237
+ title = {unified-gate: Confidence-gated adaptive LLM inference on a 7-dimensional boundary manifold},
238
+ year = {2026},
239
+ url = {https://github.com/parrishcorcoran/unified-gate}
240
+ }
241
+ ```