OzTianlu commited on
Commit
77a5060
Β·
verified Β·
1 Parent(s): 568d08f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -2
README.md CHANGED
@@ -1,6 +1,171 @@
1
  ---
2
- license: mit
3
  language:
 
4
  - zh
 
 
 
 
 
 
 
 
5
  pipeline_tag: text-generation
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  language:
4
+ - en
5
  - zh
6
+ library_name: pytorch
7
+ tags:
8
+ - transformer
9
+ - decoder-only
10
+ - pointer-networks
11
+ - knowledge-distillation
12
+ - sparse-attention
13
+ - pytorch
14
  pipeline_tag: text-generation
15
+ ---
16
+
17
+ # Pointer: Decoder-only Transformer with Relational Routing
18
+
19
+ Pointer is a novel Decoder-only transformer architecture that implements relational routing through sparse pointer mechanisms. The core innovation lies in writing "edges" into weights while dereferencing node vectors at runtime, combined with FFN blocks for non-linear transformations.
20
+
21
+ ## Model Architecture
22
+
23
+ ### Core Innovation: Pointer Block
24
+ The PointerBlock is the heart of this architecture, implementing:
25
+ - **Sparse Address Generation**: Creates sparse address distributions through top-k selection
26
+ - **Multi-head Attention**: Uses multiple attention heads for pointer computation
27
+ - **Dynamic Vector Aggregation**: Aggregates neighbor vectors based on pointer probabilities
28
+ - **Pointer-of-Pointer Chaining**: Enables hierarchical knowledge addressing across layers
29
+
30
+ ### Architecture Components
31
+
32
+ ```
33
+ TokenEmbedding β†’ [PointerLayer Γ— N] β†’ LayerNorm β†’ LM Head
34
+
35
+ PointerLayer:
36
+ β”œβ”€β”€ LayerNorm
37
+ β”œβ”€β”€ PointerBlock (sparse addressing + aggregation)
38
+ β”œβ”€β”€ Gate + Residual Connection
39
+ β”œβ”€β”€ LayerNorm
40
+ └── FFN (d β†’ d_ff β†’ d)
41
+ ```
42
+
43
+ ### Key Features
44
+ - **Relational Routing**: Only "edges" are written into weights, node vectors are dereferenced at runtime
45
+ - **Sparse Attention**: Top-k selection mechanism for efficient computation
46
+ - **Knowledge Address Chains**: Higher layers reference increasingly abstract relationship patterns
47
+ - **KV Caching**: Efficient inference with dynamic cache expansion
48
+
49
+ ## Model Specifications
50
+
51
+ | Parameter | Value |
52
+ |-----------|-------|
53
+ | Architecture | Decoder-only Transformer |
54
+ | Vocabulary Size | 50,032 |
55
+ | Hidden Dimension (d) | 4,096 |
56
+ | Number of Layers | 48 |
57
+ | Attention Heads | 32 |
58
+ | Top-k Selection | 2 |
59
+ | FFN Expansion Ratio | 2.7 |
60
+ | Sequence Length | 4,096 |
61
+ | Parameters | ~6B |
62
+
63
+ ## Training Details
64
+
65
+ ### Mix-Distillation Strategy
66
+ The model was trained using Mix-Distillation following the "Small Models Struggle to Learn from Strong Reasoners" approach:
67
+
68
+ - **Teacher Model**: DeepSeek-R1
69
+ - **Training Data**: Mix-Long strategy with Long-CoT : Short-CoT in 0.2 : 0.8 ratio
70
+ - **Training Steps**: 10,000 steps with gradient accumulation
71
+ - **Precision**: FP16 with numerical stability protections
72
+
73
+ ### Training Hyperparameters
74
+ ```yaml
75
+ batch_size: 1024
76
+ learning_rate: 3e-4
77
+ warmup_ratio: 0.05
78
+ sequence_length: 4096
79
+ optimizer: AdamW
80
+ ```
81
+
82
+ ### Loss Components
83
+ - **Cross-Entropy Loss**: Standard language modeling objective
84
+ - **Hidden State MSE**: Knowledge distillation from teacher hidden states
85
+ - **Pointer KL Divergence**: Alignment of pointer attention distributions
86
+ - **Pointer Cross-Entropy**: Hard distillation for pointer indices
87
+
88
+ ## Key Innovations
89
+
90
+ ### 1. Pointer-of-Pointer Mechanism
91
+ Each layer produces pointer indices to previous positions, and the next layer uses these indices to create "pointer-of-pointer" chains, enabling hierarchical knowledge addressing patterns.
92
+
93
+ ### 2. Sparse Relational Routing
94
+ Instead of dense attention, the model uses sparse top-k selection to identify the most relevant connections, making computation more efficient while maintaining expressiveness.
95
+
96
+ ### 3. Runtime Vector Dereferencing
97
+ Unlike traditional transformers that compute attention over all positions, Pointer writes relationship patterns into weights and dereferences specific node vectors only when needed.
98
+
99
+ ### 4. Numerical Stability for FP16
100
+ Extensive NaN detection and handling throughout the forward pass, including:
101
+ - Input validation in embeddings
102
+ - Attention score clamping
103
+ - Emergency NaN repairs
104
+
105
+ ## Usage
106
+
107
+ ```python
108
+ import torch
109
+ from src.model.pointer_model import PointerDecoder
110
+
111
+ # Initialize model
112
+ model = PointerDecoder(
113
+ vocab_size=50032,
114
+ d=4096,
115
+ n_layers=48,
116
+ n_heads=32,
117
+ top_k=2,
118
+ r=2.7
119
+ )
120
+
121
+ # Forward pass
122
+ input_ids = torch.randint(0, 50032, (1, 100))
123
+ logits = model(input_ids)
124
+
125
+ # Inference with caching
126
+ cache = model.init_cache(batch_size=1)
127
+ for token in input_sequence:
128
+ logits, cache = model.step(token, cache)
129
+ ```
130
+
131
+ ## File Structure
132
+
133
+ ```
134
+ src/
135
+ β”œβ”€β”€ layers/
136
+ β”‚ β”œβ”€β”€ embedding.py # TokenEmbedding with vocab reduction support
137
+ β”‚ β”œβ”€β”€ rotary.py # Rotary positional encoding
138
+ β”‚ β”œβ”€β”€ pointer_block.py # Core PointerBlock implementation
139
+ β”‚ β”œβ”€β”€ ffn.py # Feed-forward network
140
+ β”‚ └── pointer_layer.py # PointerBlock + FFN + Residual connections
141
+ └── model/
142
+ └── pointer_model.py # Complete PointerDecoder implementation
143
+ ```
144
+
145
+ ## Supported Languages
146
+
147
+ - English
148
+ - Chinese (Simplified)
149
+
150
+ ## Limitations
151
+
152
+ - Currently supports only left-to-right generation (no bidirectional)
153
+ - Requires careful FP16 training due to numerical stability considerations
154
+ - Top-k selection parameter needs tuning for different tasks
155
+
156
+ ## Citation
157
+
158
+ If you use this model in your research, please cite:
159
+
160
+ ```bibtex
161
+ @misc{pointer2024,
162
+ title={Pointer: Decoder-only Transformer with Relational Routing},
163
+ author={[Your Name]},
164
+ year={2024},
165
+ howpublished={\url{https://huggingface.co/[your-username]/pointer}}
166
+ }
167
+ ```
168
+
169
+ ## License
170
+
171
+ This model is released under the Apache 2.0 License.