New architecture: TemporalMesh Transformer — dynamic kNN graph attention + per-token exit routing, 29.4 PPL at 48% compute

#33

by vigneshwar234 - opened 27 days ago

TemporalMesh Transformer (TMT) — open-source, 120M params, state-of-the-art efficiency

Sharing a new transformer architecture for the community's feedback and comparison.

TMT achieves 29.4 PPL on WikiText-2 (−30.2% vs vanilla) at 48% relative compute — outperforming Mamba (31.8), RWKV (33.1), Longformer (39.6), and vanilla transformer (42.1) at ~120M parameters.

Five innovations unified in one forward pass

Mesh Attention — dynamic kNN graph (k=8) rebuilt per-layer from cosine similarity. O(S·k) vs O(S²). At S=1024: 128× fewer attention ops.
Temporal Decay Encoding — learned per-head multiplicative scalar post-softmax: ã_ij = α_ij × σ(w·|t_i−t_j|)
Adaptive Depth Routing — per-token exit gate, avg 5.76/12 layers used (52% compute saved)
Dual-Stream FFN — syntax + semantic parallel streams with sigmoid fusion gate
EMA Memory Anchors — 16 persistent fast-weight vectors (β=0.99), 32KB params

Results across 8 benchmarks

	WT-2 PPL↓	WT-103 PPL↓	LongBench↑	C4 PPL↓	Compute
Vanilla	42.1	51.3	41.2	38.4	100%
Longformer	39.6	47.2	49.8	36.1	62%
Mamba	31.8	38.4	51.3	30.1	55%
TMT	29.4	36.1	53.4	27.4	48%

Quick start

from tmt.model.config import TMTConfig
from tmt.model.model import TMTModel
model = TMTModel(TMTConfig(vocab_size=50257, d_model=512, n_heads=8, n_layers=12))
out = model(input_ids)
# out.logits, out.exit_masks, out.graph_edges, out.confidences

📄 Paper: https://zenodo.org/records/20287390 · DOI: 10.5281/zenodo.20287197
💻 Code (226 tests): https://github.com/vignesh2027/TemporalMesh-Transformer
🎮 Live Demo: https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo
🤗 Model: https://huggingface.co/vigneshwar234/TemporalMesh-Transformer

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment