unified-gate

LLM inference is overbudgeted by ~1000Γ—. The per-token difficulty signal lives on a ~7-dimensional manifold. We measured it. This is the gate.


k-sweep: engineering knee = physics ceiling

The one-minute pitch

Every speculative-decoding / early-exit / Medusa / adaptive-compute paper of the last three years is the same sensor in a different costume measuring one underlying signal: how sharp is the next-token distribution. The field keeps shipping new sensors and never builds the controller that fuses them.

This is the controller. It's a 20-feature, 64Γ—64 MLP (26 KB) that decides, per token, whether to accept a cheap draft or run the full backbone. Held-out measurement on BitNet b1.58 2B: 10.6% skip at 95% fidelity, 14.1% skip at 90% fidelity (peak K=40-50, replicated Β±0.3% over 5 seeds).

The provocative claim is not the skip rate. It's the dimensionality: the per-token difficulty surface is ~7-dimensional, measured by TwoNN on final-layer hidden states, across two architectures (BitNet 2B + Llama 3.1 8B). That's a physics-grounded ceiling, not an engineering target. It says per-token decision-making has a compute floor and we're nowhere near it.


The three claims, each measured

1. The information is on a thin surface, not in the bulk

Running 30-layer Γ— 2560-dim backbone computation for every token is redundant with what Medusa heads already read off the cached hidden state. That's the holographic principle applied to transformer inference β€” the heads are empirical proof the future tokens were already on the surface. Bulk volume is being recomputed from boundary data per step.

2. Compute and entropy are inversely correlated

Conditional next-token entropy decreases with context length (cloud tightens as context locks in plausible completions). Transformer compute per token increases with context length (O(NΒ²) attention, bigger KV cache). Current decoders scale compute up exactly when information requirement scales down. RNNs had the right compute shape β€” we traded it for capacity.

3. The gate's dimensionality is set by physics

Per-sequence intrinsic dim of final-layer hidden states, measured by TwoNN (Facco et al. 2017):

Model Ambient dim Per-seq intrinsic
BitNet b1.58 2B (result_norm) 2560 7.3
Llama 3.1 8B Q4_K_M (result_norm) 4096 6.9

Second cross-model metric: raw hidden-state participation ratio divided by ambient dim:

Model PR PR / ambient
BitNet 2B 85 3.3%
Llama 3.1 8B 151 3.7%

Two independent measurements agreeing that both models concentrate per-token decision-making into ~7 dimensions out of thousands. When we train the gate on top-K features ranked by gradient importance, K=7 recovers ~70% of the K=50 peak skip. The engineering knee of the feature-count curve lands exactly at the physics ceiling.


The measurement

5-seed K-sweep on the BitNet 2B held-out set. skip at Ξ»=0.95 fidelity (mean Β± std):

K     skip@Ξ»=0.95      Οƒ-gap vs K=70
 7    7.3% (single)    (matches per-seq intrinsic dim, 80% of peak)
15    9.2% Β± 0.3%      -2.4Οƒ (lower, expected)
20    9.8% Β± 0.2%       0.1Οƒ (matches K=70)
25   10.1% Β± 0.2%      +1.1Οƒ
30   10.5% Β± 0.3%      +2.1Οƒ
40   10.6% Β± 0.2%      +3.2Οƒ  ← peak
50   10.7% Β± 0.2%      +3.4Οƒ  ← peak
70    9.7% Β± 0.3%      baseline

The K=70 bundle is over-parameterized. Adding features past ~50 degrades the gate by ~9%, a ~3Οƒ effect replicated across seeds. This is the inference analog of parameter count β‰  information content: once you cross the per-seq manifold ceiling, extra features are just overfitting noise.


Architecture (gate_k20.pt)

  • 20 input features selected by gradient importance from a 70-feature physics-aperture bundle
  • Two hidden layers of 64 ReLU units each
  • Single sigmoid output (skip probability)
  • ~6,500 parameters, 26 KB on disk
  • Calibrated thresholds for Ξ» ∈ {0.85, 0.90, 0.95, 0.99} bundled in the checkpoint

The 20 features

Ranked by gradient importance on held-out:

  1. sup_1 β€” superposition effective rank (exp(entropy of top-K softmax))
  2. cluster_1 β€” K-means soft-cluster entropy
  3. logit_gap β€” head-0 top1 minus top2 logit
  4. content_conf β€” head-0 top-1 softmax
  5. cluster_0 β€” K-means min-distance-to-center
  6. layer_5 β€” cos(h_5, h_15) Ryu-Takayanagi layer-wise similarity
  7. layer_9 β€” layer-wise norm_15 (log)
  8. layer_7 β€” cos(h_5, h_29)
  9. top10_cov β€” head-0 cumulative top-10 probability
  10. treuse_2 β€” token-reuse rank within recent window (H2O lexical)
  11. agreement_count β€” head-0 arg-max matches head-k lagged
  12. fe_1 β€” entropy-adjusted free-energy analog
  13. rg_2 β€” renormalization-group divergence at scale 9
  14. mom_0 β€” head-0 softmax 3rd moment (skewness)
  15. vel_0 β€” hidden-state velocity β€–h_t βˆ’ h_{t-1}β€–
  16. fe_0 β€” log(1 + 0.01 Β· cluster_mindist)
  17. hnorm_0 β€” log(1 + β€–h_tβ€–)
  18. layer_1 β€” log(1 + velocity 15β†’29)
  19. nbr_0 β€” distance to nearest recent hidden state (H2O temporal)
  20. sup_0 β€” top-K token-embedding spread in hidden space

Five framings from the theory thesis, each contributing:

  • Holographic (cluster, neighborhood, free-energy)
  • Electron-cloud / superposition (sup_spread, sup_eff_rank, moments)
  • Ryu-Takayanagi depth projection (layer-wise 5/15/29 features β€” biggest single group)
  • H2O heavy-hitters (token-reuse, neighborhood)
  • Renormalization group (multi-scale coarse-graining divergence)
  • Base information-theory (confidence, logit gap, covers, agreement)

Usage

import torch
from unified_gate import Gate, extract_all_features

gate = Gate("gate_k20.pt")

# Per-sequence feature extraction
X = extract_all_features(
    hidden_last=h29,      # [T, H]  final-layer result_norm, float32
    hidden_mid=h15,       # [T, H]  middle layer
    hidden_early=h5,      # [T, H]  early layer
    head_logits=logits,   # [T, K_heads, V]  Medusa head logits
    lm_head=lm_head_np,   # [V, H]  output embeddings
    tokens=tokens,        # [T]     token ids
    period_ids=period_ids,        # precomputed from tokenizer
    newline_ids=newline_ids,
    cluster_centers=centers,      # K=32 pre-fit centers
)                           # returns [T-8, 70] float32

# Skip decision
scores = gate.score(X)                     # skip probability per token
skip_mask = gate.skip_mask(X, fidelity=0.95)
# Accept Medusa draft where skip_mask is True; re-run backbone where False.

Install from GitHub:

pip install git+https://github.com/parrishcorcoran/unified-gate.git

Reproducibility:

git clone https://github.com/parrishcorcoran/unified-gate
cd unified-gate
python scripts/reproduce.py --medusabitnet-root /path/to/MedusaBitNet

Matches stored frontier within Β±0.001 absolute skip.


Cross-model scope and limits

Validated on:

  • BitNet b1.58 2B (primary training + held-out measurement)
  • Llama 3.1 8B Q4_K_M (cross-model TwoNN intrinsic-dim agreement)

Not yet validated on:

  • Wall-clock speedup on real hardware (the systems paper follow-up)
  • Much larger models (70B+)
  • Non-English / specialized domains

Known limits:

  • The gate is trained on BitNet-specific Medusa head acceptance. Cross-model deployment requires retraining the 64Γ—64 MLP on target-model head acceptances. The feature extractor generalizes; the MLP weights don't.
  • gate_k20.pt's agreement_count feature is a 0/1 logical OR (numpy 2.x bool-add semantics in training pipeline) not a 0-3 count. A corrected retraining is on the v0.3 roadmap. In the measured frontier this is empirically fine β€” but it's a lurking name/semantics mismatch worth flagging.

Theoretical framework

Six equivalent framings β€” not six different ideas, but one underlying insight seen from six angles:

  1. Holographic principle / black-hole boundary layer β€” information about the completion is on a thin surface of the hidden state, not in the bulk compute
  2. Electron cloud / quantum probability β€” there is no "correct" next token; the cloud is the observable
  3. Fractal / hologram β€” every per-token forward is a self-similar slice of one underlying trajectory computation
  4. Compute-entropy inversion β€” conditional entropy drops through the sequence while O(NΒ²) compute per token rises; they should be correlated, they're anti-correlated
  5. Boundary layer β€” predictability lives in a thin laminar region; only a minority of tokens are boundary-class
  6. Unified sensor gate β€” all existing techniques (draft, Medusa, early exit, N-gram, bottleneck) are redundant entropy sensors; the missing piece is the controller

Full thesis including the companion spin-glass-substrate framing and the tokens-per-joule thermodynamic argument is at THEORY.md in the GitHub repo.


Roadmap

  • v0.3 β€” retrain gate with corrected agreement_count (0-3 count, not 0/1 OR)
  • v0.4 β€” Llama 3.1 8B Medusa-compatible gate (once heads are trained)
  • Paper 1 β€” this repo's measurement + theory (target: arXiv)
  • Paper 2 β€” wall-clock C++ integration (follow-up systems paper)
  • Fat-trunk / thin-branches architecture β€” direct consequence of 7-dim finding: narrow late layers, full-width early layers. Experimentally justified but untested.

Credits

  • Parrish Corcoran β€” research direction, physics framework, experimental design
  • Claude Opus 4.6 (1M context) β€” implementation, measurements, 24-hour autonomous research session (2026-04-15)

License

MIT β€” research use encouraged.


Citation

Preferred citation format until the paper lands:

@software{corcoran_unified_gate_2026,
  author = {Corcoran, Parrish},
  title  = {unified-gate: Confidence-gated adaptive LLM inference on a 7-dimensional boundary manifold},
  year   = {2026},
  url    = {https://github.com/parrishcorcoran/unified-gate}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support