OzTianlu commited on
Commit
e528314
·
verified ·
1 Parent(s): 77a5060

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -23
README.md CHANGED
@@ -51,14 +51,18 @@ PointerLayer:
51
  | Parameter | Value |
52
  |-----------|-------|
53
  | Architecture | Decoder-only Transformer |
54
- | Vocabulary Size | 50,032 |
55
- | Hidden Dimension (d) | 4,096 |
56
- | Number of Layers | 48 |
57
- | Attention Heads | 32 |
 
58
  | Top-k Selection | 2 |
59
  | FFN Expansion Ratio | 2.7 |
60
- | Sequence Length | 4,096 |
61
- | Parameters | ~6B |
 
 
 
62
 
63
  ## Training Details
64
 
@@ -72,13 +76,40 @@ The model was trained using Mix-Distillation following the "Small Models Struggl
72
 
73
  ### Training Hyperparameters
74
  ```yaml
75
- batch_size: 1024
76
- learning_rate: 3e-4
 
 
 
 
77
  warmup_ratio: 0.05
78
- sequence_length: 4096
79
- optimizer: AdamW
 
 
 
80
  ```
81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  ### Loss Components
83
  - **Cross-Entropy Loss**: Standard language modeling objective
84
  - **Hidden State MSE**: Knowledge distillation from teacher hidden states
@@ -108,18 +139,22 @@ Extensive NaN detection and handling throughout the forward pass, including:
108
  import torch
109
  from src.model.pointer_model import PointerDecoder
110
 
111
- # Initialize model
112
  model = PointerDecoder(
113
- vocab_size=50032,
114
- d=4096,
115
- n_layers=48,
116
- n_heads=32,
117
- top_k=2,
118
- r=2.7
 
 
 
 
119
  )
120
 
121
  # Forward pass
122
- input_ids = torch.randint(0, 50032, (1, 100))
123
  logits = model(input_ids)
124
 
125
  # Inference with caching
@@ -152,17 +187,19 @@ src/
152
  - Currently supports only left-to-right generation (no bidirectional)
153
  - Requires careful FP16 training due to numerical stability considerations
154
  - Top-k selection parameter needs tuning for different tasks
 
 
155
 
156
  ## Citation
157
 
158
  If you use this model in your research, please cite:
159
 
160
  ```bibtex
161
- @misc{pointer2024,
162
- title={Pointer: Decoder-only Transformer with Relational Routing},
163
- author={[Your Name]},
164
- year={2024},
165
- howpublished={\url{https://huggingface.co/[your-username]/pointer}}
166
  }
167
  ```
168
 
 
51
  | Parameter | Value |
52
  |-----------|-------|
53
  | Architecture | Decoder-only Transformer |
54
+ | Model Size | Pointer-300M |
55
+ | Vocabulary Size | Dynamic (based on tokenizer) |
56
+ | Hidden Dimension (d) | 1,024 |
57
+ | Number of Layers | 24 |
58
+ | Attention Heads | 16 |
59
  | Top-k Selection | 2 |
60
  | FFN Expansion Ratio | 2.7 |
61
+ | Maximum Sequence Length | 4,096 |
62
+ | Parameters | ~300M |
63
+ | Dropout | 0.1 |
64
+ | FP16 Training | Yes |
65
+ | Tied Embeddings | Yes |
66
 
67
  ## Training Details
68
 
 
76
 
77
  ### Training Hyperparameters
78
  ```yaml
79
+ num_epochs: 2
80
+ per_device_batch_size: 4
81
+ gradient_accumulation_steps: 4
82
+ effective_batch_size: 16 # 4 * 4
83
+ learning_rate: 2e-4
84
+ lr_scheduler: cosine
85
  warmup_ratio: 0.05
86
+ weight_decay: 0.01
87
+ save_steps: 1000
88
+ eval_steps: 500
89
+ logging_steps: 50
90
+ fp16: true
91
  ```
92
 
93
+ ### Distillation Configuration
94
+ ```yaml
95
+ temperature: 2.0
96
+ alpha: 0.5 # KD loss weight
97
+ beta: 1.0 # CE loss weight
98
+ gamma: 0.5 # Additional loss weight
99
+ use_kd_loss: true
100
+ use_ce_loss: true
101
+ use_hidden_mse: false
102
+ use_pointer_kl: false
103
+ ```
104
+
105
+ ### Training Data
106
+ - **Dataset Size**: 110,000 samples from Chinese-DeepSeek-R1-Distill
107
+ - **CoT Distribution**:
108
+ - Long-CoT: 22,000 samples (20%)
109
+ - Short-CoT: 88,000 samples (80%)
110
+ - **Sequence Length**: 21-2,048 tokens (mean: 885, median: 721)
111
+ - **Quality Scores**: 7-10 (mean: 9.09)
112
+
113
  ### Loss Components
114
  - **Cross-Entropy Loss**: Standard language modeling objective
115
  - **Hidden State MSE**: Knowledge distillation from teacher hidden states
 
139
  import torch
140
  from src.model.pointer_model import PointerDecoder
141
 
142
+ # Initialize Pointer-300M model with your config
143
  model = PointerDecoder(
144
+ vocab_size=tokenizer.vocab_size, # Dynamic based on tokenizer
145
+ d=1024, # Hidden dimension
146
+ n_layers=24, # Number of layers
147
+ n_heads=16, # Attention heads
148
+ top_k=2, # Pointer selection
149
+ r=2.7, # FFN expansion ratio
150
+ max_seq_len=4096, # Max sequence length
151
+ dropout=0.1, # Dropout rate
152
+ tie_embeddings=True, # Tie input/output embeddings
153
+ fp16=True # FP16 training
154
  )
155
 
156
  # Forward pass
157
+ input_ids = torch.randint(0, tokenizer.vocab_size, (1, 100))
158
  logits = model(input_ids)
159
 
160
  # Inference with caching
 
187
  - Currently supports only left-to-right generation (no bidirectional)
188
  - Requires careful FP16 training due to numerical stability considerations
189
  - Top-k selection parameter needs tuning for different tasks
190
+ - Model size is 300M parameters (smaller than larger language models)
191
+ - Trained primarily on Chinese data with DeepSeek-R1 distillation
192
 
193
  ## Citation
194
 
195
  If you use this model in your research, please cite:
196
 
197
  ```bibtex
198
+ @misc{pointer300m2025,
199
+ title={Pointer-300M: Decoder-only Transformer with Relational Routing},
200
+ author={[Noesis Lab]},
201
+ year={2025},
202
+ howpublished={\url{https://huggingface.co/NoesisLab/Pointer-300M}}
203
  }
204
  ```
205