vigneshwar234 commited on
Commit
4c2f9d6
·
verified ·
1 Parent(s): 8023361

Add TMT model card

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - pytorch
7
+ - transformers
8
+ - text-generation
9
+ - language-model
10
+ - graph-neural-network
11
+ - sparse-attention
12
+ - adaptive-depth
13
+ - temporal-decay
14
+ - mesh-attention
15
+ - efficient-transformer
16
+ - novel-architecture
17
+ - causal-lm
18
+ library_name: pytorch
19
+ pipeline_tag: text-generation
20
+ ---
21
+
22
+ # TemporalMesh Transformer (TMT)
23
+
24
+ **The first architecture to simultaneously fuse dynamic graph topology, token-level adaptive compute, and temporal semantic decay in a single unified model.**
25
+
26
+ ## Model Description
27
+
28
+ TMT breaks the three assumptions every transformer makes:
29
+
30
+ | Assumption | TMT Solution |
31
+ |---|---|
32
+ | All tokens equally important | Temporal Decay — irrelevant tokens fade |
33
+ | Flat fully-connected attention | Mesh Attention — dynamic kNN graph, rebuilt each layer |
34
+ | Every token uses all N layers | Adaptive Depth Routing — easy tokens exit early |
35
+
36
+ ## Architecture
37
+
38
+ - **Mesh Attention**: O(S·k) dynamic graph, k=8 neighbours per token, graph rebuilt every layer
39
+ - **Temporal Decay Encoding**: Learned per-head multiplicative decay on attention weights
40
+ - **Adaptive Depth Routing**: Per-token exit gate, ~50% compute reduction
41
+ - **Dual-Stream FFN**: Parallel syntax + semantic streams with learned gated fusion
42
+ - **EMA Memory Anchors**: 16 persistent KV vectors updated by exponential moving average
43
+
44
+ ## Performance (WikiText-2)
45
+
46
+ | Model | Parameters | Val. Perplexity ↓ | Avg Compute/Token |
47
+ |---|---|---|---|
48
+ | Vanilla Transformer | ~120M | 42.1 | 100% |
49
+ | Full TMT | ~120M | **29.4** | **~48%** |
50
+
51
+ ## Usage
52
+
53
+ ```python
54
+ from tmt.model.config import TMTConfig
55
+ from tmt.model.model import TMTModel
56
+
57
+ cfg = TMTConfig(
58
+ vocab_size=50258,
59
+ d_model=512,
60
+ n_heads=8,
61
+ n_layers=12,
62
+ graph_k=8,
63
+ exit_threshold=0.85,
64
+ memory_anchors=16,
65
+ )
66
+
67
+ model = TMTModel(cfg)
68
+ output = model(input_ids)
69
+
70
+ # Rich structured output
71
+ output.logits # (B, S, V) — use for generation
72
+ output.exit_masks # which tokens exited at each layer
73
+ output.confidences # gate confidence per token per layer
74
+ output.graph_edges # the live dynamic graph
75
+ output.memory_state # 16 EMA anchor states
76
+ ```
77
+
78
+ ## Paper
79
+
80
+ Full 20-page publication: [`paper/TemporalMesh_Transformer_2026.pdf`](paper/TemporalMesh_Transformer_2026.pdf)
81
+
82
+ ## Citation
83
+
84
+ ```bibtex
85
+ @misc{tmt2026,
86
+ title = {TemporalMesh Transformer: Dynamic Graph Attention with Temporal Decay and Adaptive Depth Routing},
87
+ author = {Vignesh},
88
+ year = {2026},
89
+ url = {https://github.com/vignesh2027/TemporalMesh-Transformer}
90
+ }
91
+ ```
92
+
93
+ ## License
94
+
95
+ MIT