Rahul2020 commited on
Commit
4db5b72
·
verified ·
1 Parent(s): 2f9108b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -534
README.md CHANGED
@@ -1,534 +1,43 @@
1
- # SmolLM2-135M Training Project
2
-
3
- A complete training pipeline for SmolLM2-135M, a lightweight language model based on the LLaMA architecture with 135 million parameters. This model has been fine-tuned exclusively on ShakespeCare's **Coriolanus**, and therefore writes in the style of a dramatic play. This project includes optimized training scripts with multiple speedup techniques and a ready-to-deploy Gradio demo.
4
-
5
- ## 📊 Model Information
6
-
7
- ### Parameter Count
8
- - **Total Parameters:** 134,515,008 (134.52M)
9
- - **Trainable Parameters:** 134,515,008 (134.52M)
10
- - **Non-trainable Parameters:** 0
11
- - **Model Size:** ~270MB (FP16) / ~540MB (FP32)
12
-
13
- ### Specialization
14
- This model has been fine-tuned exclusively on Shakespeare's **Coriolanus**. As a result, it generates text in the style of a dramatic play, including:
15
- - Character names and dialogue
16
- - Stage directions
17
- - Shakespearean language and structure
18
- - Dramatic formatting
19
-
20
- ### Model Architecture
21
-
22
- The model follows the LLaMA (Large Language Model Meta AI) architecture with the following specifications:
23
-
24
- | Component | Value |
25
- |-----------|-------|
26
- | **Model Type** | LLaMA (Decoder-only Transformer) |
27
- | **Hidden Size** | 576 |
28
- | **Intermediate Size** | 1,536 |
29
- | **Number of Layers** | 30 |
30
- | **Attention Heads** | 9 |
31
- | **Key-Value Heads** | 3 (Grouped Query Attention) |
32
- | **Vocabulary Size** | 49,152 |
33
- | **Max Position Embeddings** | 8,192 |
34
- | **RoPE Theta** | 100,000 |
35
- | **RMSNorm Epsilon** | 1e-5 |
36
-
37
- ### Architecture Features
38
-
39
- - **Grouped Query Attention (GQA)**: Uses 3 KV heads with 9 query heads for efficient attention computation
40
- - **Rotary Position Embeddings (RoPE)**: Positional encoding with theta=100,000
41
- - **RMSNorm**: Root Mean Square Layer Normalization
42
- - **Tied Word Embeddings**: Input and output embeddings share weights
43
- - **Flash Attention**: Uses PyTorch SDPA (Scaled Dot Product Attention) for faster training and inference
44
-
45
- ## 🚀 Speedup Optimizations
46
-
47
- This project implements several performance optimizations:
48
-
49
- 1. **✅ Flash Attention (SDPA)**
50
- - Uses PyTorch's `scaled_dot_product_attention` with `attn_implementation="sdpa"`
51
- - Automatically uses flash-attention kernels when available
52
- - Significantly faster attention computation
53
-
54
- 2. **✅ Autocast (Mixed Precision)**
55
- - Automatic Mixed Precision (AMP) training
56
- - Uses bfloat16 on supported GPUs, falls back to float16
57
- - Reduces memory usage and speeds up training
58
-
59
- 3. **✅ Float32 Matmul Precision**
60
- - Sets `torch.set_float32_matmul_precision("high")`
61
- - Enables TF32 on Ampere+ GPUs (A100, RTX 30xx, etc.)
62
- - Faster matrix multiplications with minimal precision loss
63
-
64
- 4. **✅ Power of 2 Optimization**
65
- - Sequence length set to 256 (power of 2)
66
- - Ensures all chunks are exactly power-of-2 sized
67
- - Optimizes GPU memory alignment and computation efficiency
68
-
69
- ## 📁 Project Structure
70
-
71
- ```
72
- .
73
- ├── accelerate_train.py # Main training script with Accelerate
74
- ├── accelerate_resume.py # Resume training from checkpoint
75
- ├── train_from_scratch.py # Simple training script (no Accelerate)
76
- ├── smollm2_135m_reverse_engineer.py # Reverse engineer model architecture from HF
77
- ├── upload_to_hub.py # Upload checkpoint to HuggingFace Hub
78
- ├── convert_to_fp16.py # Convert FP32 checkpoint to FP16 (reduce size by 50%)
79
- ├── prepare_for_hf_space.py # Prepare minimal checkpoint (remove optimizer)
80
- ├── config.py # Model configuration class
81
- ├── generate_config.py # Config generation utility
82
- ├── app.py # Gradio demo for Hugging Face Spaces
83
- ├── input.txt # Training dataset (Shakespeare's Coriolanus)
84
- ├── checkpoint_5000/ # Saved model checkpoint
85
- │ ├── config.json
86
- │ ├── model.safetensors
87
- │ └── optim.pt
88
- ├── smollm2_135m_reverse_engineered/ # Output from reverse engineering script
89
- │ ├── hf_config.json
90
- │ ├── hf_config.yaml
91
- │ └── smollm2_135m_training_skeleton.yaml
92
- └── README.md # This file
93
- ```
94
-
95
- ## 🛠️ Installation
96
-
97
- ### Requirements
98
-
99
- ```bash
100
- pip install torch transformers accelerate gradio pyyaml
101
- ```
102
-
103
- **Note:** `pyyaml` is required for the reverse engineering script (`smollm2_135m_reverse_engineer.py`).
104
-
105
- ### Optional (for better performance)
106
-
107
- ```bash
108
- # Flash Attention (if available)
109
- pip install flash-attn --no-build-isolation
110
-
111
- # Quantization support (for smaller model size in HF Spaces)
112
- pip install bitsandbytes
113
- ```
114
-
115
- **Note:** `bitsandbytes` is required for 8-bit/4-bit quantization in the Gradio app. This is useful for Hugging Face Spaces where model size is limited.
116
-
117
- ## 📖 Usage
118
-
119
- ### 0. Reverse Engineering Model Architecture (Optional)
120
-
121
- Before training, you may want to inspect and reverse engineer the SmolLM2-135M architecture from HuggingFace. This is useful for:
122
- - Understanding the exact model configuration
123
- - Extracting architecture parameters
124
- - Generating config files for training frameworks
125
- - Verifying parameter counts
126
-
127
- #### Basic Usage
128
-
129
- ```bash
130
- # Download and inspect the model config (no model weights loaded)
131
- python smollm2_135m_reverse_engineer.py
132
- ```
133
-
134
- This will:
135
- - Download the HuggingFace config and tokenizer
136
- - Display an architecture summary
137
- - Export configs to `smollm2_135m_reverse_engineered/`:
138
- - `hf_config.json` - Raw HuggingFace config
139
- - `hf_config.yaml` - Raw HuggingFace config (YAML format)
140
- - `smollm2_135m_training_skeleton.yaml` - Training-style config skeleton
141
-
142
- #### Advanced Usage
143
-
144
- ```bash
145
- # Load model weights and compute parameter statistics
146
- python smollm2_135m_reverse_engineer.py --load-model --dtype bf16
147
-
148
- # Use a different model variant
149
- python smollm2_135m_reverse_engineer.py --model-id HuggingFaceTB/SmolLM2-135M-Instruct
150
-
151
- # Custom output directory
152
- python smollm2_135m_reverse_engineer.py --output-dir my_configs
153
- ```
154
-
155
- #### Command Line Options
156
-
157
- - `--model-id`: HuggingFace model ID (default: `HuggingFaceTB/SmolLM2-135M`)
158
- - `--output-dir`: Directory for exported configs (default: `smollm2_135m_reverse_engineered`)
159
- - `--load-model`: Load model weights and compute parameter stats (requires GPU memory)
160
- - `--dtype`: Data type for model loading (`auto`, `fp16`, `bf16`)
161
- - `--device`: Device for model loading (`auto` uses HF accelerate mapping)
162
-
163
- #### Output Files
164
-
165
- The script generates several useful files:
166
-
167
- 1. **`hf_config.json` / `hf_config.yaml`**: Raw HuggingFace configuration in both formats
168
- 2. **`smollm2_135m_training_skeleton.yaml`**: Training-style YAML with:
169
- - Model architecture parameters
170
- - Token IDs (BOS, EOS, PAD)
171
- - Training hyperparameter placeholders
172
- - Dataset configuration template
173
- 3. **`smollm2_135m_param_stats.json`**: Parameter statistics (only if `--load-model` is used)
174
-
175
- **Note:** The `config.py` file in this project was generated from the reverse-engineered architecture. You can use this script to verify or regenerate the configuration.
176
-
177
- ### 1. Training from Scratch
178
-
179
- #### Using Accelerate (Recommended)
180
-
181
- ```bash
182
- # Configure accelerate (first time only)
183
- accelerate config
184
-
185
- # Start training
186
- accelerate launch accelerate_train.py
187
- ```
188
-
189
- The training script will:
190
- - Build the model from scratch using `SmolLMConfig`
191
- - Load and chunk the dataset from `input.txt`
192
- - Train for 5000 steps with automatic checkpointing
193
- - Save model to `checkpoint_5000/`
194
-
195
- #### Using Simple Training Script
196
-
197
- ```bash
198
- python train_from_scratch.py
199
- ```
200
-
201
- ### 2. Resuming Training
202
-
203
- ```bash
204
- accelerate launch accelerate_resume.py
205
- ```
206
-
207
- This will:
208
- - Load the model from `checkpoint_5000/`
209
- - Resume optimizer state
210
- - Continue training from the saved step
211
-
212
- ### 3. Running the Gradio Demo
213
-
214
- #### Local Demo
215
-
216
- ```bash
217
- python app.py
218
- ```
219
-
220
- Then open your browser to `http://localhost:7860`
221
-
222
- #### Hugging Face Spaces
223
-
224
- **Option 1: Use Pretrained Model (Recommended for CPU-only Spaces)**
225
-
226
- Since checkpoint files are large (~270MB), the easiest approach is to use the pretrained model:
227
-
228
- 1. Create a new Space on Hugging Face (CPU or GPU)
229
- 2. Upload `app.py` and `requirements.txt` (do NOT upload checkpoint files)
230
- 3. The app will automatically load from `HuggingFaceTB/SmolLM2-135M` if no checkpoint is found
231
- 4. The Space will automatically deploy
232
-
233
- **CPU Performance Note:**
234
- - The app automatically detects CPU and loads models in float32
235
- - Generation will be slower on CPU (~5-10 seconds per generation)
236
- - For faster inference, consider using a GPU-enabled Space
237
-
238
- **Option 2: Upload Checkpoint to HuggingFace Hub (Recommended)**
239
-
240
- If you want to use your fine-tuned checkpoint, upload it to HuggingFace Hub as a model repository (not in the Space):
241
-
242
- 1. Install dependencies and login:
243
- ```bash
244
- pip install huggingface_hub
245
- huggingface-cli login
246
- ```
247
-
248
- 2. **Convert to FP16 first** (if your model is FP32):
249
- ```bash
250
- python convert_to_fp16.py --checkpoint-dir checkpoint_5000 --output-dir checkpoint_fp16
251
- ```
252
-
253
- 3. Upload checkpoint (optimizer state excluded by default):
254
- ```bash
255
- python upload_to_hub.py --repo-id your-username/smollm2-135m-coriolanus --checkpoint-dir checkpoint_fp16
256
- ```
257
-
258
- **Required files uploaded:**
259
- - ✅ `config.json` (~1KB) - Model configuration
260
- - ✅ `model.safetensors` (~257MB FP16 or ~513MB FP32) - Model weights
261
- - ✅ `generation_config.json` (~1KB) - Generation settings
262
- - ❌ **Excludes `optim.pt`** (~200MB) - Not needed for inference
263
- - ❌ **Tokenizer files** - Not needed (app uses `HuggingFaceTB/SmolLM2-135M` tokenizer)
264
-
265
- 3. Set environment variable in HF Space settings:
266
- - `HF_MODEL_ID`: `your-username/smollm2-135m-coriolanus`
267
-
268
- 4. Upload only `app.py` and `requirements.txt` to the Space (no checkpoint files!)
269
-
270
- **Size Reduction:**
271
- - Original checkpoint: ~713MB (FP32 model 513MB + optimizer 200MB)
272
- - After removing optimizer: ~513MB (FP32 model only)
273
- - After converting to FP16: ~257MB (50% reduction)
274
- - Space files: <1MB (just app.py and requirements.txt)
275
-
276
- **💡 Important:** If your model is saved in FP32 (~513MB), convert it to FP16 first:
277
- ```bash
278
- python convert_to_fp16.py --checkpoint-dir checkpoint_5000 --output-dir checkpoint_fp16
279
- python upload_to_hub.py --repo-id your-username/smollm2-135m-coriolanus --checkpoint-dir checkpoint_fp16
280
- ```
281
-
282
- **Option 3: Use Quantization (GPU Only - Smaller Model Size)**
283
-
284
- To reduce model size for HF Spaces with GPU:
285
-
286
- 1. Set environment variable in HF Space settings:
287
- - `USE_QUANTIZATION`: `8bit` (for 8-bit) or `4bit` (for 4-bit quantization)
288
-
289
- 2. This reduces model size:
290
- - 8-bit: ~135MB (50% reduction)
291
- - 4-bit: ~68MB (75% reduction)
292
-
293
- 3. Upload `app.py` and `requirements.txt` to the Space
294
- 4. Add `bitsandbytes>=0.41.0` to `requirements.txt` (for GPU quantization)
295
-
296
- **Note:**
297
- - **Quantization requires GPU** - it will NOT work on CPU-only Spaces
298
- - **For CPU-only Spaces**: Use Option 1 (pretrained model) - the app automatically detects CPU and loads without quantization
299
- - The app will automatically use float32 on CPU for compatibility
300
-
301
- ### 4. Using the Model Programmatically
302
-
303
- ```python
304
- from transformers import LlamaForCausalLM, GPT2TokenizerFast
305
- import torch
306
-
307
- # Load model
308
- model = LlamaForCausalLM.from_pretrained("checkpoint_5000")
309
- tokenizer = GPT2TokenizerFast.from_pretrained("HuggingFaceTB/SmolLM2-135M")
310
- tokenizer.pad_token = tokenizer.eos_token
311
-
312
- # Generate text (model writes in dramatic play style)
313
- prompt = "CORIOLANUS:"
314
- input_ids = tokenizer(prompt, return_tensors="pt").input_ids
315
-
316
- with torch.no_grad():
317
- outputs = model.generate(
318
- input_ids,
319
- max_new_tokens=100,
320
- temperature=0.8,
321
- do_sample=True,
322
- top_p=0.9,
323
- top_k=50
324
- )
325
-
326
- generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
327
- print(generated_text)
328
- ```
329
-
330
- ## ⚙️ Configuration
331
-
332
- ### Model Definition
333
-
334
- The model is built from a simple configuration class defined in `config.py`:
335
-
336
- ```python
337
- # Auto-generated config for SmolLM2-135M
338
- class SmolLMConfig:
339
- def __init__(self):
340
- self.model_type = 'llama'
341
- self.vocab_size = 49152
342
- self.hidden_size = 576
343
- self.intermediate_size = 1536
344
- self.num_hidden_layers = 30
345
- self.num_attention_heads = 9
346
- self.num_key_value_heads = 3
347
- self.max_position_embeddings = 8192
348
- self.rms_norm_eps = 1e-05
349
- self.rope_theta = 100000
350
- self.rope_scaling = None
351
- self.bos_token_id = 0
352
- self.eos_token_id = 0
353
- self.pad_token_id = None
354
- self.tie_word_embeddings = True
355
- ```
356
-
357
- The model is instantiated in the training script:
358
-
359
- ```python
360
- from transformers import LlamaForCausalLM, LlamaConfig
361
- from config import SmolLMConfig
362
-
363
- # Build config
364
- sm_cfg = SmolLMConfig()
365
- hf_config = LlamaConfig(
366
- vocab_size=sm_cfg.vocab_size,
367
- hidden_size=sm_cfg.hidden_size,
368
- intermediate_size=sm_cfg.intermediate_size,
369
- num_hidden_layers=sm_cfg.num_hidden_layers,
370
- num_attention_heads=sm_cfg.num_attention_heads,
371
- num_key_value_heads=sm_cfg.num_key_value_heads,
372
- max_position_embeddings=sm_cfg.max_position_embeddings,
373
- rms_norm_eps=sm_cfg.rms_norm_eps,
374
- rope_theta=sm_cfg.rope_theta,
375
- tie_word_embeddings=sm_cfg.tie_word_embeddings,
376
- )
377
-
378
- # Create model with SDPA (Flash Attention)
379
- model = LlamaForCausalLM(hf_config, attn_implementation="sdpa")
380
- ```
381
-
382
- Modify the values in `SmolLMConfig` to change the model architecture.
383
-
384
- ## 🎯 Training Configuration
385
-
386
- Default training hyperparameters:
387
-
388
- - **Optimizer:** AdamW
389
- - **Learning Rate:** 2e-4
390
- - **Sequence Length:** 256
391
- - **Max Steps:** 5000
392
- - **Batch Size:** 1 (per device, scales with Accelerate)
393
- - **Mixed Precision:** Enabled (bfloat16/float16 on CUDA)
394
-
395
- ### Expected Training Output
396
-
397
- When running `accelerate_train.py`, you should see output similar to:
398
-
399
- ```
400
- Using device: cuda
401
-
402
- ==================================================
403
- Model Parameters:
404
- Total: 134,515,008 (134.52M)
405
- Trainable: 134,515,008 (134.52M)
406
- Non-trainable: 0 (0)
407
- ==================================================
408
-
409
- Loaded 1332 training chunks.
410
-
411
- Step 0 | Loss 10.8755
412
- Step 500 | Loss 4.8903
413
- Step 1000 | Loss 5.9240
414
- Step 1500 | Loss 5.1603
415
- Step 2000 | Loss 4.4749
416
- Step 2500 | Loss 4.6673
417
- Step 3000 | Loss 4.0769
418
- Step 3500 | Loss 4.8665
419
- Step 4000 | Loss 4.4102
420
- Step 4500 | Loss 3.5130
421
-
422
- Training complete. Checkpoint saved.
423
- ```
424
-
425
- **Training Notes:**
426
- - Initial loss starts around 10-11 (typical for language modeling)
427
- - Loss decreases over training steps, reaching ~3.5-4.5 after 5000 steps
428
- - The model is trained on 1,332 chunks of 256 tokens each from Coriolanus
429
- - Checkpoint is saved to `checkpoint_5000/` directory
430
-
431
- ### Resuming Training
432
-
433
- To resume training from a checkpoint:
434
-
435
- ```bash
436
- accelerate launch accelerate_resume.py
437
- ```
438
-
439
- Expected output:
440
-
441
- ```
442
- Resuming from step 5000
443
- Extra step 0 | Loss: 18.6302
444
- Extra step 10 | Loss: 11.6444
445
- Extra step 20 | Loss: 10.9953
446
- Extra step 30 | Loss: 11.1277
447
- Extra step 40 | Loss: 11.2197
448
- Resume training complete.
449
- ```
450
-
451
- **Note:** When resuming, the loss may initially spike (as shown above) because the resume script uses synthetic data for demonstration. In production, you would load your actual training dataset.
452
-
453
- ## 📝 Dataset Format
454
-
455
- The training script expects `input.txt` in the project root. The file should contain plain text that will be tokenized and chunked into sequences of length 256.
456
-
457
- **Current Dataset:** This model is fine-tuned exclusively on Shakespeare's **Coriolanus** (contained in `input.txt`). As a result, the model generates text in the style of a dramatic play, with character names, stage directions, and Shakespearean dialogue.
458
-
459
- The script automatically chunks the text into sequences of length 256 for training.
460
-
461
- ## 🔧 Advanced Usage
462
-
463
- ### Multi-GPU Training
464
-
465
- Accelerate automatically handles multi-GPU training:
466
-
467
- ```bash
468
- # Use all available GPUs
469
- accelerate launch --multi_gpu accelerate_train.py
470
-
471
- # Use specific GPUs
472
- CUDA_VISIBLE_DEVICES=0,1 accelerate launch accelerate_train.py
473
- ```
474
-
475
- ### Custom Dataset
476
-
477
- Modify the `load_dataset` function in `accelerate_train.py` to load your custom dataset format.
478
-
479
- ### Checkpoint Management
480
-
481
- Checkpoints are saved in the format:
482
- ```
483
- checkpoint_5000/
484
- ├── config.json # Model configuration
485
- ├── model.safetensors # Model weights
486
- ├── generation_config.json # Generation settings
487
- └── optim.pt # Optimizer state
488
- ```
489
-
490
- ## 🐛 Troubleshooting
491
-
492
- ### Windows Autocast Issues
493
- Autocast is disabled on Windows by default. The training will still work but may be slower.
494
-
495
- ### Out of Memory
496
- - Reduce `seq_len` in the training script
497
- - Use gradient accumulation
498
- - Enable CPU offloading with Accelerate
499
-
500
- ### Flash Attention Not Working
501
- If flash attention isn't available, the model will fall back to standard attention. This is handled automatically.
502
-
503
- ### Tokenizer Warnings (Fixed)
504
- The training script has been updated to handle common warnings:
505
-
506
- - **Sequence Length Warning**: If you see "sequence length is longer than max_position_embeddings", this is now automatically suppressed. The script temporarily increases the tokenizer's `model_max_length` during data loading, then chunks the data into 256-token sequences for training.
507
-
508
- - **Attention Mask Warnings**: Generation warnings about attention masks and pad tokens have been fixed by explicitly providing attention masks and token IDs during generation.
509
-
510
- These warnings were harmless but have been resolved for a cleaner training experience.
511
-
512
- ## 📚 References
513
-
514
- - [LLaMA Paper](https://arxiv.org/abs/2302.13971)
515
- - [SmolLM2 Model Card](https://huggingface.co/HuggingFaceTB/SmolLM2-135M)
516
- - [Hugging Face Transformers](https://huggingface.co/docs/transformers)
517
- - [Accelerate Documentation](https://huggingface.co/docs/accelerate)
518
-
519
- ## 📄 License
520
-
521
- This project uses the SmolLM2-135M model, which follows the original model's license. Please check the Hugging Face model card for licensing details.
522
-
523
- ## 🤝 Contributing
524
-
525
- Contributions are welcome! Please feel free to submit a Pull Request.
526
-
527
- ## 📧 Contact
528
-
529
- For questions or issues, please open an issue on the repository.
530
-
531
- ---
532
-
533
- **Note:** This is a training and inference framework. The model weights are trained from scratch or loaded from checkpoints. Make sure you have appropriate data and compute resources for training.
534
-
 
1
+ ---
2
+ title: SmolLM2-135M Coriolanus
3
+ emoji: 🎭
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 4.0.0
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # SmolLM2-135M Coriolanus Text Generation
13
+
14
+ A lightweight language model (135M parameters) fine-tuned exclusively on Shakespeare's **Coriolanus**. The model writes in the style of a dramatic play, complete with character names, stage directions, and Shakespearean dialogue.
15
+
16
+ ## Features
17
+
18
+ - **135M Parameters** - Efficient and fast inference
19
+ - **Grouped Query Attention (GQA)** - Optimized attention mechanism
20
+ - **Flash Attention** - Fast attention computation
21
+ - **Interactive UI** - Easy-to-use Gradio interface
22
+
23
+ ## Model Architecture
24
+
25
+ - **Hidden Size:** 576
26
+ - **Layers:** 30
27
+ - **Attention Heads:** 9 (3 KV heads)
28
+ - **Vocabulary:** 49,152 tokens
29
+ - **Max Context:** 8,192 tokens
30
+
31
+ ## Usage
32
+
33
+ 1. Enter your prompt in the text box (try prompts like "CORIOLANUS:" or "Enter CORIOLANUS and MENENIUS")
34
+ 2. Adjust generation parameters (temperature, top-p, etc.)
35
+ 3. Click "Generate" to create text in the style of a dramatic play
36
+
37
+ ## Parameters
38
+
39
+ - **Temperature:** Controls randomness (lower = more focused)
40
+ - **Top-p:** Nucleus sampling threshold
41
+ - **Top-k:** Limits to top k tokens
42
+ - **Repetition Penalty:** Reduces repetition (higher = less repetition)
43
+