TemporalMesh Transformer (TMT v3)

Author: Vigneshwar LK
Paper: DOI 10.5281/zenodo.20287197
Code: github.com/vignesh2027/TemporalMesh-Transformer
Live Demo: HuggingFace Space
Benchmarks: TMT-Benchmarks Dataset

What is TMT?

TMT is a novel PyTorch transformer architecture that simultaneously resolves three fundamental inefficiencies in standard transformers:

Problem	Standard Transformer	TMT Solution
Quadratic attention cost	$O(S^2)$ per layer	Mesh Attention: $O(S \cdot k)$ dynamic $k$NN graph
Static attention topology	Fixed fully-connected	Dynamic graph rebuilt per-layer from cosine similarity
Uniform token compute	All tokens use all $N$ layers	Adaptive Depth Routing: exit gate per token, avg 5.8/12 layers
Flat positional encoding	Position only	Temporal Decay: learned multiplicative semantic attenuation
No cross-sequence memory	Stateless	EMA Memory Anchors: 16 persistent fast-weight vectors

Results

Model	WikiText-2 PPL ↓	WikiText-103 PPL ↓	LongBench ↑	Compute
Vanilla Transformer	42.1	51.3	41.2	100%
Longformer	39.6	47.2	49.8	62%
Mamba	31.8	38.4	51.3	55%
RWKV	33.1	40.9	48.7	50%
Full TMT	29.4	36.1	53.4	48%

All models: ~120M parameters. TMT trained for 10K steps on WikiText-2 (AdamW, cosine LR, seeds 42/1337/2024).

Architecture at a Glance

Input → Token Embedding + RoPE
      → [× 12 layers]
           MeshBuilder (kNN graph, cosine sim, top-k=8)
           Mesh Attention  O(S·k)  + Temporal Decay Encoding
           EMA Memory Anchor Cross-Attention (16 anchors, β=0.99)
           Dual-Stream FFN (syntax stream ‖ semantic stream, sigmoid gate)
           Exit Gate  σ(W_gate · x) > 0.85 → token frozen
      → LayerNorm → Tied Output Projection
      → Logits (B, S, V)

Output fields (TMTOutput dataclass):

logits — (B, S, V) next-token predictions
exit_masks — list of (B, S) booleans, one per layer
confidences — gate confidence per token per layer
graph_edges — sparse kNN edge list from final layer
memory_state — (M, D) final EMA anchor states
decay_scalars — temporal decay weights applied

Quick Start

git clone https://github.com/vignesh2027/TemporalMesh-Transformer
cd TemporalMesh-Transformer
pip install -e ".[dev]"

from tmt.model.config import TMTConfig
from tmt.model.model import TMTModel
import torch

config = TMTConfig(
    vocab_size=50257,
    d_model=512,
    n_heads=8,
    n_layers=12,
    graph_k=8,
    exit_threshold=0.85,
    memory_anchors=16,
)
model = TMTModel(config)  # ~120M params

tokens = torch.randint(0, 50257, (1, 256))
out = model(tokens)

print(out.logits.shape)      # (1, 256, 50257)
print(out.exit_masks[-1])    # which tokens exited at layer 12
avg_exit = sum(m.float().mean() for m in out.exit_masks) / len(out.exit_masks)
print(f"Avg exit layer: {avg_exit:.2f}")  # ~5.8

Training

python scripts/train.py \
  --dataset wikitext-2 \
  --model_size base \
  --steps 10000 \
  --lr 3e-4 \
  --batch_size 16 \
  --seq_len 256 \
  --exit_threshold 0.85 \
  --graph_k 8

Ablation Summary

Config	PPL ↓	Compute	VRAM
Vanilla Transformer	42.1	100%	18.4 GB
+ Mesh Attention only	37.8	62%	11.2 GB
+ Temporal Decay only	40.3	98%	18.4 GB
+ Adaptive Exit only	39.6	51%	18.4 GB
Mesh + Decay	34.2	61%	11.2 GB
Mesh + Exit	35.1	50%	11.2 GB
Full TMT	29.4	48%	11.2 GB

The full combination achieves superadditive gains: interaction effect = 4.1 PPL beyond sum of individual contributions.

Citation

@misc{vigneshwar2026tmt,
  title   = {TemporalMesh Transformer: Dynamic Graph Attention with
             Temporal Semantic Decay and Per-Token Adaptive Depth Routing},
  author  = {Vigneshwar LK},
  year    = {2026},
  doi     = {10.5281/zenodo.20287197},
  url     = {https://zenodo.org/records/20287390}
}

License

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train vigneshwar234/TemporalMesh-Transformer

Space using vigneshwar234/TemporalMesh-Transformer 1

Evaluation results

Validation Perplexity on WikiText-2
self-reported

29.400
Validation Perplexity on WikiText-103
self-reported

36.100