TemporalMesh Transformer: dynamic kNN graph attention + adaptive exit gates, 29.4 PPL at 48% compute

#41
by vigneshwar234 - opened

New open-source transformer architecture โ€” directly relevant to this repo

TMT achieves 29.4 PPL on WikiText-2 at 48% compute (โˆ’30.2% vs vanilla, 120M params). Directly relevant to users comparing efficient attention and depth-adaptive architectures.

Five innovations: Mesh Attention (O(Sยทk) dynamic kNN), Temporal Decay (post-softmax multiplicative), Adaptive Exit Gate (per-token depth routing, avg 5.76/12 layers), Dual-Stream FFN, EMA Memory Anchors.

vs. models in this category:

  • Beats Mamba: 29.4 vs 31.8 PPL, same 120M params
  • Beats Longformer: 29.4 vs 39.6 PPL, same compute class
  • LongBench: 53.4 vs 51.3 Mamba

๐Ÿ“„ Paper (DOI 10.5281/zenodo.20287197): https://zenodo.org/records/20287390
๐Ÿ’ป Code + 226 tests: https://github.com/vignesh2027/TemporalMesh-Transformer
๐ŸŽฎ Live demo: https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo

Sign up or log in to comment