agileabhi commited on
Commit
0d648cf
·
verified ·
1 Parent(s): 84d9ea6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +300 -0
README.md ADDED
@@ -0,0 +1,300 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - text-generation
7
+ - transformers
8
+ - pytorch
9
+ - custom-implementation
10
+ - language-model
11
+ - educational
12
+ library_name: transformers
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # SmolLM2-135M-Dissecting
17
+
18
+ A custom implementation of the SmolLM2-135M language model architecture, trained from scratch for educational purposes. This project demonstrates building a transformer-based language model with 147.8M parameters.
19
+
20
+ ## Model Description
21
+
22
+ This is a **custom implementation** that mimics the SmolLM2-135M architecture. It was built from scratch to understand the inner workings of small language models and includes:
23
+
24
+ - Custom transformer blocks with multi-head attention
25
+ - Rotary Position Embeddings (RoPE)
26
+ - SwiGLU activation functions
27
+ - Layer normalization and residual connections
28
+
29
+ **Note**: This is an educational implementation trained on a small dataset. For production use, consider the official [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) model.
30
+
31
+ ## Model Details
32
+
33
+ - **Model Type**: Causal Language Model (Decoder-only Transformer)
34
+ - **Architecture**: Custom SmolLM2-135M implementation
35
+ - **Total Parameters**: 147,821,184
36
+ - **Training Dataset**: Custom text dataset (1,115,394 characters)
37
+ - **Training Steps**: 5,000 steps
38
+ - **Language**: English
39
+ - **License**: Apache 2.0
40
+
41
+ ### Architecture Specifications
42
+
43
+ - **Vocabulary Size**: 49,152
44
+ - **Hidden Size**: 576
45
+ - **Number of Layers**: 30
46
+ - **Attention Heads**: 9
47
+ - **Intermediate Size**: 1,536
48
+ - **Max Position Embeddings**: 2,048
49
+ - **Head Dimension**: 64
50
+ - **Activation Function**: SwiGLU
51
+ - **Position Embedding**: Rotary Position Embedding (RoPE)
52
+
53
+ ## Training Process
54
+
55
+ ### Initialization
56
+
57
+ The training started with model initialization on CPU:
58
+
59
+ ```
60
+ Using device: cpu
61
+ Initializing custom model...
62
+ Total parameters: 147,821,184
63
+ ```
64
+
65
+ ### Dataset Preparation
66
+
67
+ The tokenizer loaded successfully, and the input text was tokenized:
68
+
69
+ ```
70
+ Loading tokenizer...
71
+ tokenizer_config.json: 3.66kB [00:00, 2.50MB/s]
72
+ vocab.json: 801kB [00:00, 5.63MB/s]
73
+ merges.txt: 466kB [00:00, 5.45MB/s]
74
+ tokenizer.json: 2.10MB [00:00, 7.78MB/s]
75
+ ```
76
+
77
+ The training dataset consisted of:
78
+ - 666 chunks of 512 tokens each
79
+ - Batch size: 4
80
+ - Steps per epoch: 167
81
+ - Total training steps: 5,000
82
+
83
+ ## Training Progress
84
+
85
+ ### Loss Reduction Over Time
86
+
87
+ The model showed consistent improvement throughout training:
88
+
89
+ | Step | Loss | Improvement |
90
+ |------|------|-------------|
91
+ | 0 (initial) | N/A | - |
92
+ | 500 | 4.6897 | Baseline |
93
+ | 1000 | 4.0074 | -14.6% |
94
+ | 1500 | 3.4715 | -26.0% |
95
+ | 2000 | 2.8648 | -38.8% |
96
+ | 2500 | 2.2658 | -51.7% |
97
+ | 3000 | 1.5617 | -66.7% |
98
+ | 3500 | 1.0885 | -76.8% |
99
+ | 4000 | 0.8004 | -82.9% |
100
+ | 4500 | 0.5178 | -88.9% |
101
+ | 5000 (final) | 0.3271 | -93.0% |
102
+
103
+ ### Model Generation Quality Improvement
104
+
105
+ The model's text generation ability improved significantly:
106
+
107
+ **Step 0 (Before Training)**:
108
+ ```
109
+ Generated: What is English Muscle Kelly flossing towardsimatingćBind outrageroutine dreTClywood loudly brightness hardships
110
+ ```
111
+
112
+ **Step 500**:
113
+ ```
114
+ Generated: What is Englishour.
115
+ HOLANIO:
116
+ My name you
117
+ To the king, I'll tell this in theREM;
118
+ ```
119
+
120
+ **Step 1000**:
121
+ ```
122
+ Generated: What is English's They knows no their place?
123
+ ISABELLA:
124
+ Speak me:
125
+ I am a grave to the maid and sh son.
126
+ ```
127
+
128
+ **Step 2000**:
129
+ ```
130
+ Generated: What is English'd to say theAnd I will come.
131
+ KING EDWARD IV:
132
+ Go, Warwick, in all my friends, my lords.
133
+ ```
134
+
135
+ **Step 5000 (Final)**:
136
+ ```
137
+ Generated: What is English quarter
138
+ To frame of the people to himself.
139
+ CAMILLO:
140
+ God and your noble lord,
141
+ She does do much need on't.
142
+ ```
143
+
144
+ ### Loss Convergence
145
+
146
+ The loss curve showed gradual but steady improvement:
147
+
148
+ - **Epochs 1-3**: Rapid initial decrease from ~9.6 to ~4.7
149
+ - **Epochs 4-10**: Continued improvement to ~3.9
150
+ - **Epochs 11-20**: Moderate improvement to ~2.0
151
+ - **Epochs 21-30**: Final optimization to ~0.3
152
+
153
+ ## Model Architecture Verification
154
+
155
+ After training, the custom model's architecture was compared against the official SmolLM2-135M:
156
+
157
+ ```
158
+ Custom model parameters: 364
159
+ Official model parameters: 273
160
+ Matching parameters: 1
161
+ Only in custom: 363
162
+ Only in official: 272
163
+ ```
164
+
165
+ The architecture verification revealed a partial match with some parameter naming differences between the custom implementation and the official model.
166
+
167
+ ## Usage
168
+
169
+ ### Loading the Model
170
+
171
+ ```python
172
+ import torch
173
+ from model import CustomSmolLM, ModelConfig
174
+ from transformers import AutoTokenizer
175
+
176
+ # Initialize model configuration
177
+ config = ModelConfig()
178
+
179
+ # Load the model
180
+ model = CustomSmolLM(config)
181
+ model.load_state_dict(torch.load('checkpoints/final_model.pt')['model_state_dict'])
182
+ model.eval()
183
+
184
+ # Load tokenizer (uses official SmolLM2 tokenizer)
185
+ tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
186
+ ```
187
+
188
+ ### Text Generation
189
+
190
+ ```python
191
+ import torch.nn.functional as F
192
+
193
+ def generate_text(model, tokenizer, prompt, max_length=50, temperature=0.8):
194
+ model.eval()
195
+ device = next(model.parameters()).device
196
+
197
+ # Tokenize prompt
198
+ input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
199
+
200
+ with torch.no_grad():
201
+ for _ in range(max_length):
202
+ outputs = model(input_ids)
203
+ logits = outputs['logits']
204
+ next_token_logits = logits[:, -1, :] / temperature
205
+ probs = F.softmax(next_token_logits, dim=-1)
206
+ next_token = torch.multinomial(probs, num_samples=1)
207
+ input_ids = torch.cat([input_ids, next_token], dim=1)
208
+
209
+ if next_token.item() == tokenizer.eos_token_id:
210
+ break
211
+
212
+ return tokenizer.decode(input_ids[0], skip_special_tokens=True)
213
+
214
+ # Generate text
215
+ generated = generate_text(model, tokenizer, "Once upon a time", max_length=50)
216
+ print(generated)
217
+ ```
218
+
219
+ ### Resuming Training
220
+
221
+ ```python
222
+ from train import load_checkpoint
223
+
224
+ # Resume from a checkpoint
225
+ model, checkpoint = load_checkpoint(model, 'checkpoints/checkpoint_step_500.pt')
226
+ print(f"Resumed from step {checkpoint['step']}")
227
+ ```
228
+
229
+ ## Training Configuration
230
+
231
+ - **Learning Rate**: 1e-4
232
+ - **Optimizer**: AdamW with betas (0.9, 0.95)
233
+ - **Weight Decay**: 0.1
234
+ - **Gradient Clipping**: 1.0
235
+ - **Batch Size**: 4
236
+ - **Sequence Length**: 512 tokens
237
+ - **Checkpoint Frequency**: Every 500 steps
238
+ - **Device**: CPU (GPU recommended for faster training)
239
+
240
+ ## Intended Uses
241
+
242
+ This model is designed for:
243
+
244
+ - Educational purposes and understanding transformer architectures
245
+ - Experimenting with small-scale language model training
246
+ - Learning about PyTorch implementation of modern LLM components
247
+ - Demonstrating custom model architecture development
248
+
249
+ ## Limitations
250
+
251
+ - Trained on a small dataset (1.1M characters), limiting generalization
252
+ - Only 5,000 training steps - significantly less than production models
253
+ - No evaluation on standardized benchmarks
254
+ - Architecture has some divergence from official SmolLM2-135M parameter naming
255
+ - Not suitable for production use cases
256
+ - May produce inconsistent or incorrect text
257
+
258
+ ## Ethical Considerations
259
+
260
+ This is an educational model trained on a small dataset. Users should:
261
+
262
+ - Not rely on it for factual information
263
+ - Be aware it may generate biased or inappropriate content
264
+ - Use it only for learning and experimentation
265
+ - Consider the official SmolLM2-135M for any serious applications
266
+
267
+ ## Citation
268
+
269
+ If you use this implementation in your research or projects, please cite:
270
+
271
+ ```bibtex
272
+ @misc{smollm2-135m-dissecting,
273
+ title={SmolLM2-135M-Dissecting: A Custom Implementation for Educational Purposes},
274
+ author={agileabhi},
275
+ year={2025},
276
+ howpublished={\url{https://huggingface.co/spaces/agileabhi/SmolLM2-135M-Model}}
277
+ }
278
+ ```
279
+
280
+ Also consider citing the original SmolLM2 work from Hugging Face.
281
+
282
+ ## Acknowledgments
283
+
284
+ - Based on the SmolLM2-135M architecture by Hugging Face
285
+ - Uses the official SmolLM2 tokenizer
286
+ - Inspired by modern transformer implementations
287
+
288
+ ## Repository Structure
289
+
290
+ - `model.py`: Custom model architecture implementation
291
+ - `train.py`: Training script with checkpointing and evaluation
292
+ - `app.py`: Gradio demo interface
293
+ - `strip_weights.py`: Utility for model weight management
294
+ - `upload_to_spaces.py`: Hugging Face Spaces deployment script
295
+ - `checkpoints/`: Model checkpoints saved during training
296
+ - `input.txt`: Training data file
297
+
298
+ ## Contact
299
+
300
+ For questions or issues, please open an issue on the GitHub repository.