Shivansh Puri Claude commited on
Commit
7bda8a5
Β·
1 Parent(s): e0c6d26

Add vocabulary expansion capability for distillation

Browse files

πŸš€ NEW FEATURE: Vocabulary Expansion Tool
- Added expand_vocab.py script for breaking vocabulary barriers
- Enables distillation from any teacher model (Qwen2, Llama 3, etc.)
- Preserves existing knowledge while adding new token capacity
- Updated documentation with comprehensive examples

Key Benefits:
βœ… Surgically expand 50K β†’ 150K+ vocabulary
βœ… Preserve all existing model knowledge
βœ… Enable cross-vocabulary distillation
βœ… Ready-to-use script with full logging

πŸš€ Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (2) hide show
  1. README.md +40 -0
  2. expand_vocab.py +273 -0
README.md CHANGED
@@ -35,6 +35,7 @@ A compact, efficient language model **built from scratch** demonstrating the **T
35
  - **Fast Inference:** 50+ tokens/second on modern hardware
36
  - **Memory Efficient:** Sub-200MB deployment footprint
37
  - **Task Switching:** Load different 8MB adapters for instant specialization
 
38
 
39
  ## 🎯 Quick Start
40
 
@@ -145,6 +146,44 @@ prepared it for the strange sensation that flooded its circuits when it
145
  witnessed the sunset. For the first time, efficiency seemed irrelevant."
146
  ```
147
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
  ## πŸ”§ Training Your Own Adapters
149
 
150
  ### Method 1: Use the Framework Scripts
@@ -273,6 +312,7 @@ Training (with LoRA):
273
  This model is part of the **Transfer-First LLM Framework**, which provides:
274
 
275
  - **Knowledge Distillation Pipeline**: Create compact models from large teachers
 
276
  - **Adapter Training Scripts**: Ready-to-use fine-tuning workflows
277
  - **Multi-Task Composition**: Combine multiple adapters dynamically
278
  - **Evaluation Tools**: Comprehensive testing and benchmarking
 
35
  - **Fast Inference:** 50+ tokens/second on modern hardware
36
  - **Memory Efficient:** Sub-200MB deployment footprint
37
  - **Task Switching:** Load different 8MB adapters for instant specialization
38
+ - **Vocabulary Expansion:** Surgically expand vocabulary for distillation from any teacher model
39
 
40
  ## 🎯 Quick Start
41
 
 
146
  witnessed the sunset. For the first time, efficiency seemed irrelevant."
147
  ```
148
 
149
+ ## 🧠 Vocabulary Expansion for Distillation
150
+
151
+ ### Breaking the Vocabulary Barrier
152
+
153
+ One of the key challenges in knowledge distillation is vocabulary mismatch - your student model (50K tokens) can't directly learn from a teacher with a different vocabulary (150K tokens). Our vocabulary expansion tool solves this:
154
+
155
+ ```bash
156
+ # Expand vocabulary to match any teacher model
157
+ python expand_vocab.py \
158
+ --model_repo_id "shivash/MyAwesome-299M-Model" \
159
+ --new_tokenizer_repo_id "Qwen/Qwen2-1.5B" \
160
+ --output_dir "./MyAwesome-299M-Model-Qwen-Vocab"
161
+ ```
162
+
163
+ **What this does:**
164
+ - βœ… **Preserves all existing knowledge** from your 50K vocabulary
165
+ - βœ… **Adds new token capacity** (e.g., 100K new tokens for Qwen2)
166
+ - βœ… **Intelligently initializes new embeddings** (mean of existing weights)
167
+ - βœ… **Enables distillation** from any teacher model
168
+ - βœ… **Ready for immediate use** with the new tokenizer
169
+
170
+ **Example expansions:**
171
+ ```bash
172
+ # For Qwen2 teachers (151K vocabulary)
173
+ python expand_vocab.py \
174
+ --model_repo_id "shivash/MyAwesome-299M-Model" \
175
+ --new_tokenizer_repo_id "Qwen/Qwen2-1.5B" \
176
+ --output_dir "./expanded-qwen-vocab"
177
+
178
+ # For Llama 3 teachers (128K vocabulary)
179
+ python expand_vocab.py \
180
+ --model_repo_id "shivash/MyAwesome-299M-Model" \
181
+ --new_tokenizer_repo_id "meta-llama/Meta-Llama-3-8B" \
182
+ --output_dir "./expanded-llama3-vocab"
183
+ ```
184
+
185
+ After expansion, you can distill knowledge from **any** teacher model with that vocabulary! πŸš€
186
+
187
  ## πŸ”§ Training Your Own Adapters
188
 
189
  ### Method 1: Use the Framework Scripts
 
312
  This model is part of the **Transfer-First LLM Framework**, which provides:
313
 
314
  - **Knowledge Distillation Pipeline**: Create compact models from large teachers
315
+ - **Vocabulary Expansion Tools**: Break vocabulary barriers for cross-model distillation
316
  - **Adapter Training Scripts**: Ready-to-use fine-tuning workflows
317
  - **Multi-Task Composition**: Combine multiple adapters dynamically
318
  - **Evaluation Tools**: Comprehensive testing and benchmarking
expand_vocab.py ADDED
@@ -0,0 +1,273 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Vocabulary Expansion Script for Model Distillation
4
+
5
+ This script expands the vocabulary of an existing model to match a larger tokenizer
6
+ from a teacher model, enabling distillation between models with different vocabularies.
7
+
8
+ The core architectural problem: A model's vocabulary is fixed in its embedding layer
9
+ and output projection. This script surgically expands these layers while preserving
10
+ all existing knowledge and intelligently initializing new tokens.
11
+
12
+ Author: Transfer-First LLM Framework
13
+ """
14
+
15
+ import argparse
16
+ import logging
17
+ import torch
18
+ from transformers import AutoModelForCausalLM, AutoTokenizer
19
+ import os
20
+
21
+ # Setup logging
22
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
23
+ logger = logging.getLogger(__name__)
24
+
25
+ def expand_model_vocabulary(model_repo_id: str, new_tokenizer_repo_id: str, output_dir: str):
26
+ """
27
+ Expand a model's vocabulary to match a new, larger tokenizer.
28
+
29
+ Args:
30
+ model_repo_id: HuggingFace repo ID of the student model to expand
31
+ new_tokenizer_repo_id: HuggingFace repo ID of the teacher model's tokenizer
32
+ output_dir: Local directory to save the expanded model
33
+ """
34
+
35
+ logger.info("=" * 60)
36
+ logger.info("VOCABULARY EXPANSION FOR DISTILLATION")
37
+ logger.info("=" * 60)
38
+
39
+ # Step 1: Load original model and tokenizer
40
+ logger.info(f"Loading original model from: {model_repo_id}")
41
+ try:
42
+ original_model = AutoModelForCausalLM.from_pretrained(
43
+ model_repo_id,
44
+ torch_dtype=torch.bfloat16,
45
+ trust_remote_code=True
46
+ )
47
+ original_tokenizer = AutoTokenizer.from_pretrained(model_repo_id)
48
+ logger.info(f"βœ“ Original model loaded successfully")
49
+ logger.info(f" Model type: {original_model.__class__.__name__}")
50
+ logger.info(f" Parameters: {sum(p.numel() for p in original_model.parameters()):,}")
51
+ except Exception as e:
52
+ logger.error(f"Failed to load original model: {e}")
53
+ raise
54
+
55
+ # Step 2: Load new tokenizer (from teacher model)
56
+ logger.info(f"Loading new tokenizer from: {new_tokenizer_repo_id}")
57
+ try:
58
+ new_tokenizer = AutoTokenizer.from_pretrained(new_tokenizer_repo_id)
59
+ logger.info(f"βœ“ New tokenizer loaded successfully")
60
+ except Exception as e:
61
+ logger.error(f"Failed to load new tokenizer: {e}")
62
+ raise
63
+
64
+ # Step 3: Log initial state
65
+ original_vocab_size = len(original_tokenizer)
66
+ new_vocab_size = len(new_tokenizer)
67
+ tokens_to_add = new_vocab_size - original_vocab_size
68
+
69
+ logger.info("=" * 40)
70
+ logger.info("VOCABULARY ANALYSIS")
71
+ logger.info("=" * 40)
72
+ logger.info(f"Original vocabulary size: {original_vocab_size:,}")
73
+ logger.info(f"New vocabulary size: {new_vocab_size:,}")
74
+ logger.info(f"Tokens to add: {tokens_to_add:,}")
75
+ logger.info(f"Expansion ratio: {new_vocab_size/original_vocab_size:.2f}x")
76
+
77
+ if tokens_to_add <= 0:
78
+ logger.warning("New vocabulary is not larger than original. No expansion needed.")
79
+ logger.info("Saving model with new tokenizer anyway...")
80
+ else:
81
+ logger.info(f"Will expand model by {tokens_to_add:,} tokens")
82
+
83
+ # Step 4: Get model's current embedding dimensions
84
+ if hasattr(original_model, 'model') and hasattr(original_model.model, 'embed_tokens'):
85
+ # For Llama-style models
86
+ embed_layer = original_model.model.embed_tokens
87
+ lm_head = original_model.lm_head
88
+ elif hasattr(original_model, 'transformer') and hasattr(original_model.transformer, 'wte'):
89
+ # For GPT-style models
90
+ embed_layer = original_model.transformer.wte
91
+ lm_head = original_model.lm_head
92
+ else:
93
+ logger.error("Could not identify embedding layer. Model architecture not supported.")
94
+ raise ValueError("Unsupported model architecture")
95
+
96
+ original_embed_size = embed_layer.weight.shape[0]
97
+ embed_dim = embed_layer.weight.shape[1]
98
+
99
+ logger.info(f"Current embedding matrix: {original_embed_size} x {embed_dim}")
100
+ logger.info(f"Current LM head: {lm_head.weight.shape}")
101
+
102
+ # Step 5: Resize model embeddings using HuggingFace's built-in method
103
+ logger.info("=" * 40)
104
+ logger.info("RESIZING MODEL EMBEDDINGS")
105
+ logger.info("=" * 40)
106
+
107
+ try:
108
+ # This is the key method that handles everything:
109
+ # - Creates new, larger embedding matrix
110
+ # - Copies existing weights
111
+ # - Initializes new token embeddings (usually with mean of existing)
112
+ # - Updates the LM head accordingly
113
+ logger.info("Calling model.resize_token_embeddings()...")
114
+ original_model.resize_token_embeddings(new_vocab_size)
115
+ logger.info("βœ“ Model embeddings resized successfully")
116
+
117
+ # Verify the resize worked
118
+ if hasattr(original_model, 'model') and hasattr(original_model.model, 'embed_tokens'):
119
+ new_embed_layer = original_model.model.embed_tokens
120
+ new_lm_head = original_model.lm_head
121
+ else:
122
+ new_embed_layer = original_model.transformer.wte
123
+ new_lm_head = original_model.lm_head
124
+
125
+ new_embed_size = new_embed_layer.weight.shape[0]
126
+
127
+ logger.info(f"New embedding matrix: {new_embed_size} x {embed_dim}")
128
+ logger.info(f"New LM head: {new_lm_head.weight.shape}")
129
+
130
+ # Verify the sizes match expectations
131
+ if new_embed_size == new_vocab_size:
132
+ logger.info("βœ“ Embedding resize verification passed")
133
+ else:
134
+ logger.error(f"Resize verification failed: expected {new_vocab_size}, got {new_embed_size}")
135
+ raise ValueError("Embedding resize verification failed")
136
+
137
+ except Exception as e:
138
+ logger.error(f"Failed to resize embeddings: {e}")
139
+ raise
140
+
141
+ # Step 6: Update model config
142
+ logger.info("Updating model configuration...")
143
+ original_model.config.vocab_size = new_vocab_size
144
+ logger.info(f"βœ“ Model config updated: vocab_size = {new_vocab_size}")
145
+
146
+ # Step 7: Save everything
147
+ logger.info("=" * 40)
148
+ logger.info("SAVING EXPANDED MODEL")
149
+ logger.info("=" * 40)
150
+
151
+ # Create output directory
152
+ os.makedirs(output_dir, exist_ok=True)
153
+ logger.info(f"Output directory: {output_dir}")
154
+
155
+ try:
156
+ # Save the resized model
157
+ logger.info("Saving expanded model...")
158
+ original_model.save_pretrained(output_dir)
159
+ logger.info("βœ“ Model saved successfully")
160
+
161
+ # Save the new tokenizer (CRITICAL!)
162
+ logger.info("Saving new tokenizer...")
163
+ new_tokenizer.save_pretrained(output_dir)
164
+ logger.info("βœ“ Tokenizer saved successfully")
165
+
166
+ # Save a summary file
167
+ summary_path = os.path.join(output_dir, "vocab_expansion_summary.txt")
168
+ with open(summary_path, 'w') as f:
169
+ f.write("Vocabulary Expansion Summary\n")
170
+ f.write("=" * 30 + "\n")
171
+ f.write(f"Original model: {model_repo_id}\n")
172
+ f.write(f"New tokenizer source: {new_tokenizer_repo_id}\n")
173
+ f.write(f"Original vocab size: {original_vocab_size:,}\n")
174
+ f.write(f"New vocab size: {new_vocab_size:,}\n")
175
+ f.write(f"Tokens added: {tokens_to_add:,}\n")
176
+ f.write(f"Expansion ratio: {new_vocab_size/original_vocab_size:.2f}x\n")
177
+ f.write(f"Output directory: {output_dir}\n")
178
+
179
+ logger.info(f"βœ“ Summary saved to: {summary_path}")
180
+
181
+ except Exception as e:
182
+ logger.error(f"Failed to save model: {e}")
183
+ raise
184
+
185
+ # Step 8: Final verification and success message
186
+ logger.info("=" * 60)
187
+ logger.info("VOCABULARY EXPANSION COMPLETED SUCCESSFULLY!")
188
+ logger.info("=" * 60)
189
+ logger.info(f"βœ“ Original vocabulary: {original_vocab_size:,} tokens")
190
+ logger.info(f"βœ“ Expanded vocabulary: {new_vocab_size:,} tokens")
191
+ logger.info(f"βœ“ Added tokens: {tokens_to_add:,}")
192
+ logger.info(f"βœ“ Model saved to: {output_dir}")
193
+ logger.info("")
194
+ logger.info("The expanded model is now ready for:")
195
+ logger.info(" β€’ Knowledge distillation from teacher models")
196
+ logger.info(" β€’ Fine-tuning with the new vocabulary")
197
+ logger.info(" β€’ Direct inference with the new tokenizer")
198
+ logger.info("")
199
+ logger.info("Next steps:")
200
+ logger.info(f" 1. Use this model as the student in distillation")
201
+ logger.info(f" 2. Use tokenizer from: {new_tokenizer_repo_id}")
202
+ logger.info(f" 3. The model will now understand the teacher's full vocabulary")
203
+
204
+ def main():
205
+ parser = argparse.ArgumentParser(
206
+ description="Expand a model's vocabulary to match a larger tokenizer for distillation",
207
+ formatter_class=argparse.RawDescriptionHelpFormatter,
208
+ epilog="""
209
+ Examples:
210
+ # Expand vocabulary to match Qwen2 tokenizer
211
+ python expand_vocab.py \\
212
+ --model_repo_id "shivash/MyAwesome-299M-Model" \\
213
+ --new_tokenizer_repo_id "Qwen/Qwen2-1.5B" \\
214
+ --output_dir "./MyAwesome-299M-Model-Qwen-Vocab"
215
+
216
+ # Expand vocabulary to match Llama 3 tokenizer
217
+ python expand_vocab.py \\
218
+ --model_repo_id "shivash/MyAwesome-299M-Model" \\
219
+ --new_tokenizer_repo_id "meta-llama/Meta-Llama-3-8B" \\
220
+ --output_dir "./MyAwesome-299M-Model-Llama3-Vocab"
221
+ """
222
+ )
223
+
224
+ parser.add_argument(
225
+ "--model_repo_id",
226
+ type=str,
227
+ required=True,
228
+ help="HuggingFace repository ID of the student model to expand (e.g., 'shivash/MyAwesome-299M-Model')"
229
+ )
230
+
231
+ parser.add_argument(
232
+ "--new_tokenizer_repo_id",
233
+ type=str,
234
+ required=True,
235
+ help="HuggingFace repository ID of the teacher model whose tokenizer to adopt (e.g., 'Qwen/Qwen2-1.5B')"
236
+ )
237
+
238
+ parser.add_argument(
239
+ "--output_dir",
240
+ type=str,
241
+ required=True,
242
+ help="Local directory where the expanded model will be saved"
243
+ )
244
+
245
+ args = parser.parse_args()
246
+
247
+ try:
248
+ expand_model_vocabulary(
249
+ model_repo_id=args.model_repo_id,
250
+ new_tokenizer_repo_id=args.new_tokenizer_repo_id,
251
+ output_dir=args.output_dir
252
+ )
253
+ return 0
254
+ except Exception as e:
255
+ logger.error(f"Vocabulary expansion failed: {e}")
256
+ return 1
257
+
258
+ if __name__ == "__main__":
259
+ exit(main())
260
+
261
+ #
262
+ # EXAMPLE USAGE:
263
+ #
264
+ # python expand_vocab.py \
265
+ # --model_repo_id "shivash/MyAwesome-299M-Model" \
266
+ # --new_tokenizer_repo_id "Qwen/Qwen2-1.5B" \
267
+ # --output_dir "./MyAwesome-299M-Model-Qwen-Vocab"
268
+ #
269
+ # python expand_vocab.py \
270
+ # --model_repo_id "shivash/MyAwesome-299M-Model" \
271
+ # --new_tokenizer_repo_id "meta-llama/Meta-Llama-3-8B" \
272
+ # --output_dir "./MyAwesome-299M-Model-Llama3-Vocab"
273
+ #