agileabhi
/

SmolLM2-135M-Model

+---
+language:
+- en
+license: apache-2.0
+tags:
+- text-generation
+- transformers
+- pytorch
+- custom-implementation
+- language-model
+- educational
+library_name: transformers
+pipeline_tag: text-generation
+---
+# SmolLM2-135M-Dissecting
+A custom implementation of the SmolLM2-135M language model architecture, trained from scratch for educational purposes. This project demonstrates building a transformer-based language model with 147.8M parameters.
+## Model Description
+This is a **custom implementation** that mimics the SmolLM2-135M architecture. It was built from scratch to understand the inner workings of small language models and includes:
+- Custom transformer blocks with multi-head attention
+- Rotary Position Embeddings (RoPE)
+- SwiGLU activation functions
+- Layer normalization and residual connections
+**Note**: This is an educational implementation trained on a small dataset. For production use, consider the official [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) model.
+## Model Details
+- **Model Type**: Causal Language Model (Decoder-only Transformer)
+- **Architecture**: Custom SmolLM2-135M implementation
+- **Total Parameters**: 147,821,184
+- **Training Dataset**: Custom text dataset (1,115,394 characters)
+- **Training Steps**: 5,000 steps
+- **Language**: English
+- **License**: Apache 2.0
+### Architecture Specifications
+- **Vocabulary Size**: 49,152
+- **Hidden Size**: 576
+- **Number of Layers**: 30
+- **Attention Heads**: 9
+- **Intermediate Size**: 1,536
+- **Max Position Embeddings**: 2,048
+- **Head Dimension**: 64
+- **Activation Function**: SwiGLU
+- **Position Embedding**: Rotary Position Embedding (RoPE)
+## Training Process
+### Initialization
+The training started with model initialization on CPU:
+```
+Using device: cpu
+Initializing custom model...
+Total parameters: 147,821,184
+```
+### Dataset Preparation
+The tokenizer loaded successfully, and the input text was tokenized:
+```
+Loading tokenizer...
+tokenizer_config.json: 3.66kB [00:00, 2.50MB/s]
+vocab.json: 801kB [00:00, 5.63MB/s]
+merges.txt: 466kB [00:00, 5.45MB/s]
+tokenizer.json: 2.10MB [00:00, 7.78MB/s]
+```
+The training dataset consisted of:
+- 666 chunks of 512 tokens each
+- Batch size: 4
+- Steps per epoch: 167
+- Total training steps: 5,000
+## Training Progress
+### Loss Reduction Over Time
+The model showed consistent improvement throughout training:
+| Step | Loss | Improvement |
+|------|------|-------------|
+| 0 (initial) | N/A | - |
+| 500 | 4.6897 | Baseline |
+| 1000 | 4.0074 | -14.6% |
+| 1500 | 3.4715 | -26.0% |
+| 2000 | 2.8648 | -38.8% |
+| 2500 | 2.2658 | -51.7% |
+| 3000 | 1.5617 | -66.7% |
+| 3500 | 1.0885 | -76.8% |
+| 4000 | 0.8004 | -82.9% |
+| 4500 | 0.5178 | -88.9% |
+| 5000 (final) | 0.3271 | -93.0% |
+### Model Generation Quality Improvement
+The model's text generation ability improved significantly:
+**Step 0 (Before Training)**:
+```
+Generated: What is English Muscle Kelly flossing towardsimatingćBind outrageroutine dreTClywood loudly brightness hardships
+```
+**Step 500**:
+```
+Generated: What is Englishour.
+HOLANIO:
+My name you
+To the king, I'll tell this in theREM;
+```
+**Step 1000**:
+```
+Generated: What is English's They knows no their place?
+ISABELLA:
+Speak me:
+I am a grave to the maid and sh son.
+```
+**Step 2000**:
+```
+Generated: What is English'd to say theAnd I will come.
+KING EDWARD IV:
+Go, Warwick, in all my friends, my lords.
+```
+**Step 5000 (Final)**:
+```
+Generated: What is English quarter
+To frame of the people to himself.
+CAMILLO:
+God and your noble lord,
+She does do much need on't.
+```
+### Loss Convergence
+The loss curve showed gradual but steady improvement:
+- **Epochs 1-3**: Rapid initial decrease from ~9.6 to ~4.7
+- **Epochs 4-10**: Continued improvement to ~3.9
+- **Epochs 11-20**: Moderate improvement to ~2.0
+- **Epochs 21-30**: Final optimization to ~0.3
+## Model Architecture Verification
+After training, the custom model's architecture was compared against the official SmolLM2-135M:
+```
+Custom model parameters: 364
+Official model parameters: 273
+Matching parameters: 1
+Only in custom: 363
+Only in official: 272
+```
+The architecture verification revealed a partial match with some parameter naming differences between the custom implementation and the official model.
+## Usage
+### Loading the Model
+```python
+import torch
+from model import CustomSmolLM, ModelConfig
+from transformers import AutoTokenizer
+# Initialize model configuration
+config = ModelConfig()
+# Load the model
+model = CustomSmolLM(config)
+model.load_state_dict(torch.load('checkpoints/final_model.pt')['model_state_dict'])
+model.eval()
+# Load tokenizer (uses official SmolLM2 tokenizer)
+tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
+```
+### Text Generation
+```python
+import torch.nn.functional as F
+def generate_text(model, tokenizer, prompt, max_length=50, temperature=0.8):
+    model.eval()
+    device = next(model.parameters()).device
+    # Tokenize prompt
+    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
+    with torch.no_grad():
+        for _ in range(max_length):
+            outputs = model(input_ids)
+            logits = outputs['logits']
+            next_token_logits = logits[:, -1, :] / temperature
+            probs = F.softmax(next_token_logits, dim=-1)
+            next_token = torch.multinomial(probs, num_samples=1)
+            input_ids = torch.cat([input_ids, next_token], dim=1)
+            if next_token.item() == tokenizer.eos_token_id:
+                break
+    return tokenizer.decode(input_ids[0], skip_special_tokens=True)
+# Generate text
+generated = generate_text(model, tokenizer, "Once upon a time", max_length=50)
+print(generated)
+```
+### Resuming Training
+```python
+from train import load_checkpoint
+# Resume from a checkpoint
+model, checkpoint = load_checkpoint(model, 'checkpoints/checkpoint_step_500.pt')
+print(f"Resumed from step {checkpoint['step']}")
+```
+## Training Configuration
+- **Learning Rate**: 1e-4
+- **Optimizer**: AdamW with betas (0.9, 0.95)
+- **Weight Decay**: 0.1
+- **Gradient Clipping**: 1.0
+- **Batch Size**: 4
+- **Sequence Length**: 512 tokens
+- **Checkpoint Frequency**: Every 500 steps
+- **Device**: CPU (GPU recommended for faster training)
+## Intended Uses
+This model is designed for:
+- Educational purposes and understanding transformer architectures
+- Experimenting with small-scale language model training
+- Learning about PyTorch implementation of modern LLM components
+- Demonstrating custom model architecture development
+## Limitations
+- Trained on a small dataset (1.1M characters), limiting generalization
+- Only 5,000 training steps - significantly less than production models
+- No evaluation on standardized benchmarks
+- Architecture has some divergence from official SmolLM2-135M parameter naming
+- Not suitable for production use cases
+- May produce inconsistent or incorrect text
+## Ethical Considerations
+This is an educational model trained on a small dataset. Users should:
+- Not rely on it for factual information
+- Be aware it may generate biased or inappropriate content
+- Use it only for learning and experimentation
+- Consider the official SmolLM2-135M for any serious applications
+## Citation
+If you use this implementation in your research or projects, please cite:
+```bibtex
+@misc{smollm2-135m-dissecting,
+  title={SmolLM2-135M-Dissecting: A Custom Implementation for Educational Purposes},
+  author={agileabhi},
+  year={2025},
+  howpublished={\url{https://huggingface.co/spaces/agileabhi/SmolLM2-135M-Model}}
+}
+```
+Also consider citing the original SmolLM2 work from Hugging Face.
+## Acknowledgments
+- Based on the SmolLM2-135M architecture by Hugging Face
+- Uses the official SmolLM2 tokenizer
+- Inspired by modern transformer implementations
+## Repository Structure
+- `model.py`: Custom model architecture implementation
+- `train.py`: Training script with checkpointing and evaluation
+- `app.py`: Gradio demo interface
+- `strip_weights.py`: Utility for model weight management
+- `upload_to_spaces.py`: Hugging Face Spaces deployment script
+- `checkpoints/`: Model checkpoints saved during training
+- `input.txt`: Training data file
+## Contact
+For questions or issues, please open an issue on the GitHub repository.