OzTianlu commited on
Commit
cf7f85d
·
verified ·
1 Parent(s): 623b961

Upload 5 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ paper_figure1_efficiency.png filter=lfs diff=lfs merge=lfs -text
37
+ paper_figure2_longrange.png filter=lfs diff=lfs merge=lfs -text
38
+ paper_figure3_interpretability.png filter=lfs diff=lfs merge=lfs -text
39
+ trained_pointer_heatmap_0.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - pointer-networks
6
+ - efficient-transformers
7
+ - long-range-modeling
8
+ - linear-complexity
9
+ - sequence-modeling
10
+ - interpretability
11
+ library_name: pytorch
12
+ pipeline_tag: text-generation
13
+ ---
14
+
15
+ # Pointer: Linear-Complexity Long-Range Modeling without Pre-training
16
+
17
+ <div align="center">
18
+ <img src="paper_figure1_efficiency.png" alt="Efficiency Comparison" width="600"/>
19
+ <p><i>Pointer maintains linear scaling while Transformer shows quadratic growth</i></p>
20
+ </div>
21
+
22
+ ## Model Description
23
+
24
+ **Pointer** is a novel neural architecture that achieves **linear O(NK) complexity** for long-range sequence modeling through explicit layer-wise pointer chaining, eliminating the quadratic bottleneck of standard attention mechanisms.
25
+
26
+ Unlike attention-based approaches that compute O(N²) pairwise interactions, Pointer creates structured long-distance connections via pointer chains where each layer's selection depends on previous layers' pointer positions.
27
+
28
+ ### Key Features
29
+
30
+ - **Linear Complexity**: O(NK) operations where K ≪ N, providing **2-10× speedup** on sequences of length 2048+ compared to standard transformers
31
+ - **No Pre-training Required**: Learns structured patterns from scratch, eliminating reliance on large-scale pre-training
32
+ - **Interpretable Architecture**: Pointer heatmaps reveal hierarchical processing strategies with clear layer specialization
33
+ - **Exact Computation**: Unlike approximation methods, Pointer computes exact structured connections
34
+
35
+ ## Architecture Innovation
36
+
37
+ ### Layer-wise Pointer Chaining
38
+
39
+ Each position `i` selects exactly one target position `p_i^(ℓ)` per layer, with subsequent layers building upon these selections to form dependency paths:
40
+
41
+ ```
42
+ p_i^(ℓ) = argmax_j Score(h_i^(ℓ), h_j^(ℓ), p_i^(ℓ-1))
43
+ ```
44
+
45
+ This creates a dependency chain where each layer's pointer decisions influence subsequent layers, enabling the formation of structured long-range connections.
46
+
47
+ ### Complexity Analysis
48
+
49
+ - **Computational**: O(NK) vs O(N²d) for standard attention
50
+ - **Memory**: O(N) pointer indices vs O(N²) attention weights
51
+ - **Scaling**: For N=8192, d=512: ~4M operations vs ~34B for attention (**~10,000× reduction**)
52
+
53
+ <div align="center">
54
+ <img src="paper_figure2_longrange.png" alt="Long-range Performance" width="500"/>
55
+ <p><i>Consistent accuracy across increasing distances (512-2048 tokens)</i></p>
56
+ </div>
57
+
58
+ ## Performance
59
+
60
+ ### Efficiency Benchmarks
61
+
62
+ | Sequence Length | 256 | 512 | 1024 | 2048 |
63
+ |----------------|-----|-----|------|------|
64
+ | **Training Time (s)** |
65
+ | Pointer | 0.35 | 0.29 | 0.55 | 1.45 |
66
+ | Vanilla Transformer | 0.17 | 0.35 | 1.04 | 3.55 |
67
+ | **Speedup** | 0.48× | 0.83× | 1.89× | **2.45×** |
68
+ | **Throughput (tokens/s)** |
69
+ | Pointer | 14,446 | 34,914 | 37,189 | 28,268 |
70
+ | Vanilla Transformer | 30,320 | 29,427 | 19,703 | 11,549 |
71
+
72
+ ### Long-Range Dependency Modeling
73
+
74
+ Copy task accuracy across variable-length gaps:
75
+
76
+ | Distance | 512 | 1024 | 1536 | 2048 |
77
+ |----------|-----|------|------|------|
78
+ | Pointer | 4.38% | 5.50% | 5.38% | 5.25% |
79
+ | Vanilla Transformer | 5.38% | 4.25% | 4.88% | 4.75% |
80
+
81
+ Training loss decreased from 3.13 to 2.99 across distances, demonstrating effective learning.
82
+
83
+ ## Interpretability
84
+
85
+ <div align="center">
86
+ <img src="paper_figure3_interpretability.png" alt="Interpretability Analysis" width="500"/>
87
+ <p><i>Pointer patterns reveal hierarchical processing across layers</i></p>
88
+ </div>
89
+
90
+ ### Layer Specialization
91
+
92
+ - **Early layers (0-2)**: Focus on local patterns (average hop distance ~47-58 tokens)
93
+ - **Later layers (3-5)**: Establish long-range connections (up to 483 tokens)
94
+ - **Emergent hierarchy**: Local → global processing arises through gradient-based learning
95
+
96
+ <div align="center">
97
+ <img src="trained_pointer_heatmap_0.png" alt="Pointer Heatmap" width="400"/>
98
+ <p><i>Detailed pointer heatmap showing learned attention patterns</i></p>
99
+ </div>
100
+
101
+ ### Structured Patterns
102
+
103
+ - **Self-loops**: Information retention across layers
104
+ - **Local clusters**: Capturing phrasal structure
105
+ - **Long jumps**: Bridging distant contexts
106
+
107
+ ## Use Cases
108
+
109
+ Pointer is particularly effective for:
110
+
111
+ - **Long-context processing**: Sequences beyond attention's practical limits (4K-8K tokens)
112
+ - **Edge deployment**: Reduced memory and compute requirements for on-device inference
113
+ - **Low-resource domains**: No pre-training dependency makes it accessible without massive corpora
114
+ - **Structured reasoning tasks**: Retrieval, copying, explicit dependency modeling
115
+ - **Interpretable AI**: Clear visualization of learned dependency patterns
116
+
117
+ ## Model Configuration
118
+
119
+ ```python
120
+ # Example configuration (3.2M parameters)
121
+ config = {
122
+ "num_layers": 6,
123
+ "num_heads": 8,
124
+ "hidden_dim": 256,
125
+ "vocab_size": 10000,
126
+ "max_seq_length": 2048,
127
+ "pointer_temperature": 1.0, # Gumbel-Softmax temperature
128
+ }
129
+ ```
130
+
131
+ ## Training
132
+
133
+ ### Differentiable Pointer Selection
134
+
135
+ During training, Gumbel-Softmax enables differentiable pointer selection:
136
+
137
+ ```python
138
+ # Gumbel-Softmax for training
139
+ s_tilde = (s + gumbel_noise) / temperature
140
+ alpha = softmax(s_tilde)
141
+
142
+ # Hard selection for inference
143
+ p = argmax(s)
144
+ ```
145
+
146
+ ### Training Tips
147
+
148
+ - Start with higher temperature (τ=1.0) and anneal during training
149
+ - Use teacher forcing for sequence generation tasks
150
+ - Monitor pointer hop distances to ensure long-range learning
151
+ - Visualize pointer heatmaps to verify structured pattern emergence
152
+
153
+ ## Limitations
154
+
155
+ - **Task specificity**: Excels on tasks with clear dependency paths; may underperform on dense semantic composition
156
+ - **Evaluation scope**: Current results focus on controlled synthetic tasks (copy tasks)
157
+ - **Generation quality**: Metrics measure teacher-forcing accuracy rather than autoregressive generation quality
158
+ - **Single pointer per position**: Current implementation selects one target; multi-head variants could capture more complex patterns
159
+
160
+ ## Citation
161
+
162
+ ```bibtex
163
+ @article{li2025pointer,
164
+ title={Pointer: Linear-Complexity Long-Range Modeling without Pre-training},
165
+ author={Li, Zixi},
166
+ journal={arXiv preprint},
167
+ year={2025},
168
+ institution={Noesis Lab, Sun Yat-sen University}
169
+ }
170
+ ```
171
+
172
+ ## Related Work
173
+
174
+ This work is part of broader research on adjacency-structured parallel propagation (ASPP):
175
+
176
+ - **TreeGPT**: Bidirectional TreeFFN processing for visual reasoning
177
+ - **Asterisk Operator**: Formal ASPP framework with universality theorems
178
+ - **Pointer**: Dynamic graph construction through learned pointer chains
179
+
180
+ ## License
181
+
182
+ MIT License
183
+
184
+ ## Contact
185
+
186
+ - **Author**: Zixi Li
187
+ - **Institution**: Noesis Lab (Independent Research Group), Sun Yat-sen University
188
+ - **Email**: lizx93@mail2.sysu.edu.cn
189
+
190
+ ---
191
+
192
+ <div align="center">
193
+ <p><b>Note</b>: Model weights are not currently available. This card documents the architecture and experimental results from the research paper.</p>
194
+ </div>
paper_figure1_efficiency.png ADDED

Git LFS Details

  • SHA256: 6889009c0ee656e4013fe06ceec1a788a481d938c3bd98979ff035f540641216
  • Pointer size: 131 Bytes
  • Size of remote file: 740 kB
paper_figure2_longrange.png ADDED

Git LFS Details

  • SHA256: 3b6ae800f5efe78b366a882348405bdab8c603a0d3eb1231587ee8ceb2b703af
  • Pointer size: 131 Bytes
  • Size of remote file: 485 kB
paper_figure3_interpretability.png ADDED

Git LFS Details

  • SHA256: 8d4254be18bec503cb6d1a442a085fde0273ffc229ecfedb05fe8778c1e6aab9
  • Pointer size: 131 Bytes
  • Size of remote file: 660 kB
trained_pointer_heatmap_0.png ADDED

Git LFS Details

  • SHA256: bfd33385eeea7314fea9312f79230b7e0733b2ce97b59e203face6fd8363cdc8
  • Pointer size: 131 Bytes
  • Size of remote file: 217 kB