Spaces:

Sualeh77
/

smollm2-135m-trained-on-tinyShakespear-forfun

Running

App Files Files Community

Sualeh Qureshi commited on 3 days ago

Commit

c175ce3

0 Parent(s):

Commited the training code and model file

Browse files

Files changed (17) hide show

.gitignore +14 -0
.python-version +1 -0
README.md +0 -0
README_TRAINING.md +110 -0
logs/tensorboard/version_0/events.out.tfevents.1765268407.MAC-QNYQPC2R2T.88043.0 +0 -0
logs/tensorboard/version_0/hparams.yaml +5 -0
logs/tensorboard/version_1/events.out.tfevents.1765274926.MAC-QNYQPC2R2T.7268.0 +0 -0
logs/tensorboard/version_2/events.out.tfevents.1765275552.MAC-QNYQPC2R2T.7768.0 +0 -0
logs/tensorboard/version_2/hparams.yaml +5 -0
logs/training_20251209_135005.log +54 -0
logs/training_20251209_154910.log +35 -0
main.py +6 -0
model.py +589 -0
pyproject.toml +17 -0
test_model_implementation.py +187 -0
train.py +360 -0
uv.lock +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,14 @@

+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+# Virtual environments
+.venv
+# Checkpoints
+checkpoints/

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.12

README.md ADDED Viewed

File without changes

README_TRAINING.md ADDED Viewed

	@@ -0,0 +1,110 @@

+# SmolLM2-135M Training Guide
+This directory contains the training code for SmolLM2-135M model.
+## Files
+- `model.py`: Model definition with KV cache support for inference
+- `train.py`: Main training script (trains for 5000 steps)
+- Run with checkpoint path to Resume training for 50 additional steps
+## Setup
+Install required packages:
+```bash
+pip install torch lightning transformers tensorboard
+```
+## Training
+### Phase 1: Initial Training (5000 steps)
+Run the main training script:
+```bash
+python train.py
+```
+This will:
+- Train the model for 5000 steps
+- Generate text predictions every 500 steps
+- Save checkpoints every 500 steps
+- Log training metrics to TensorBoard and text file
+- Save the final checkpoint at step 5000
+### Phase 2: Resume Training (50 additional steps)
+After Phase 1 completes, run:
+```bash
+python train.py
+```
+But this time set the checkpoint path, and set steps as 50 to resume training for 50 additional steps. just to showcase that training is started where it stopped.
+This will:
+- Load the checkpoint from Phase 1
+- Train for 50 additional steps
+- Save the final checkpoint
+## Training Configuration
+The training uses the following hyperparameters (from the SmolLM2 paper):
+- **Optimizer**: AdamW with (β₁, β₂) = (0.9, 0.95)
+- **Learning Rate Schedule**: Warmup Stable Decay (WSD)
+  - Warmup: 2000 steps
+  - Peak LR: 5.0 × 10⁻⁴
+  - Stable phase: maintains peak LR
+  - Decay: reduces to zero over 10% of total steps
+- **Block size**: 512 tokens
+- **Batch size**: 4
+- **Precision**: bfloat16 (if GPU available), float32 otherwise
+## Outputs
+- **Checkpoints**: Saved in `./checkpoints/`
+- **TensorBoard logs**: Saved in `./logs/tensorboard/`
+- **Text logs**: Saved in `./logs/training_*.log`
+## Model Features
+The model includes:
+- **KV Cache**: Efficient inference using key-value caching
+- **Generation**: Text generation with top-k and top-p sampling
+- **Checkpointing**: Full state saving for resuming training
+## Usage Example
+```python
+from model import SmolLM2, SmolConfig
+from transformers import AutoTokenizer, AutoConfig
+# Load config
+hf_config = AutoConfig.from_pretrained("HuggingFaceTB/SmolLM2-135M")
+config = SmolConfig.from_hf(hf_config)
+# Create model
+model = SmolLM2(config)
+# Load checkpoint
+checkpoint = torch.load("checkpoints/smollm2-00500-*.ckpt")
+model.load_state_dict(checkpoint['state_dict'])
+# Generate text
+tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
+prompt = "First Citizen:"
+input_ids = tokenizer.encode(prompt, return_tensors='pt')
+generated_ids = model.generate(
+    input_ids,
+    max_new_tokens=100,
+    temperature=0.8,
+    top_k=50,
+)
+generated_text = tokenizer.decode(generated_ids[0])
+print(generated_text)
+```

logs/tensorboard/version_0/events.out.tfevents.1765268407.MAC-QNYQPC2R2T.88043.0 ADDED Viewed

Binary file (5.59 kB). View file

logs/tensorboard/version_0/hparams.yaml ADDED Viewed

	@@ -0,0 +1,5 @@

+block_size: 512
+peak_lr: 0.0005
+predict_every: 500
+total_steps: 5000
+warmup_steps: 1000

logs/tensorboard/version_1/events.out.tfevents.1765274926.MAC-QNYQPC2R2T.7268.0 ADDED Viewed

Binary file (88 Bytes). View file

logs/tensorboard/version_2/events.out.tfevents.1765275552.MAC-QNYQPC2R2T.7768.0 ADDED Viewed

Binary file (2.8 kB). View file

logs/tensorboard/version_2/hparams.yaml ADDED Viewed

	@@ -0,0 +1,5 @@

+block_size: 512
+peak_lr: 0.0005
+predict_every: 500
+total_steps: 3500
+warmup_steps: 1000

logs/training_20251209_135005.log ADDED Viewed

	@@ -0,0 +1,54 @@

+2025-12-09 13:50:05,106 - INFO - Logging to: logs/training_20251209_135005.log
+2025-12-09 13:50:05,106 - INFO - Loading tokenizer...
+2025-12-09 13:50:05,965 - INFO - Loading model config...
+2025-12-09 13:50:06,205 - INFO - Loading dataset from: /Users/qureshsu/Learning/TSAI/ERAV4/session13/data/input.txt
+2025-12-09 13:50:06,657 - INFO - Initializing model...
+2025-12-09 13:50:07,391 - INFO - Starting training...
+2025-12-09 13:50:24,556 - INFO -
+================================================================================
+2025-12-09 13:50:24,557 - INFO - MODEL SUMMARY
+2025-12-09 13:50:24,557 - INFO - ================================================================================
+2025-12-09 13:50:24,557 - INFO - Model: SmolLM2-135M
+2025-12-09 13:50:24,557 - INFO - Total parameters: 134,515,008
+2025-12-09 13:50:24,557 - INFO - Trainable parameters: 134,515,008
+2025-12-09 13:50:24,557 - INFO - Block size: 512
+2025-12-09 13:50:24,557 - INFO - Warmup steps: 1000
+2025-12-09 13:50:24,557 - INFO - Peak learning rate: 0.0005
+2025-12-09 13:50:24,557 - INFO - Total training steps: 5000
+2025-12-09 13:50:24,557 - INFO - Predict every: 500 steps
+2025-12-09 13:50:24,557 - INFO - ================================================================================
+2025-12-09 14:05:59,075 - INFO -
+================================================================================
+2025-12-09 14:05:59,081 - INFO - Step 500 - Generated text:
+2025-12-09 14:05:59,081 - INFO - First Citizen:
+WhatONEONE:
+DUKE VINCENTIO:
+DUKE VINCENTIO:
+Nay, thou art thou pow pow pow pow pow pow pow pow pow pow pow pow pow pow pow pow pow pow
+2025-12-09 14:05:59,081 - INFO - ================================================================================
+2025-12-09 14:21:21,767 - INFO -
+================================================================================
+2025-12-09 14:21:21,771 - INFO - Step 1000 - Generated text:
+2025-12-09 14:21:21,771 - INFO - First Citizen:
+And then, like thee: thou hast thou dost in thy husband'st:
+And in thy soldiers, not in thy master's name,
+Which then in thy shame: I did thy shame,
+Which thou doth know her
+2025-12-09 14:21:21,771 - INFO - ================================================================================
+2025-12-09 14:37:17,744 - INFO -
+================================================================================
+2025-12-09 14:37:17,748 - INFO - Step 1500 - Generated text:
+2025-12-09 14:37:17,748 - INFO - First Citizen:
+I have done a'rt too that, if the king had title to the
+Where it shall be the is born to be in the tongue.
+Second Citizen:
+And so shall I.
+ANTONIO:
+I
+2025-12-09 14:37:17,748 - INFO - ================================================================================

logs/training_20251209_154910.log ADDED Viewed

	@@ -0,0 +1,35 @@

+2025-12-09 15:49:10,023 - INFO - Logging to: logs/training_20251209_154910.log
+2025-12-09 15:49:10,023 - INFO - Loading tokenizer...
+2025-12-09 15:49:10,936 - INFO - Loading model config...
+2025-12-09 15:49:11,184 - INFO - Loading dataset from: /Users/qureshsu/Learning/TSAI/ERAV4/session13/data/input.txt
+2025-12-09 15:49:11,623 - INFO - Initializing model...
+2025-12-09 15:49:12,354 - INFO - Starting training...
+2025-12-09 15:49:12,357 - INFO - Resuming from checkpoint: checkpoints/smollm2-step=01500-train_loss=3.6240.ckpt
+2025-12-09 15:49:30,901 - INFO -
+================================================================================
+2025-12-09 15:49:30,901 - INFO - MODEL SUMMARY
+2025-12-09 15:49:30,901 - INFO - ================================================================================
+2025-12-09 15:49:30,901 - INFO - Model: SmolLM2-135M
+2025-12-09 15:49:30,901 - INFO - Total parameters: 134,515,008
+2025-12-09 15:49:30,901 - INFO - Trainable parameters: 134,515,008
+2025-12-09 15:49:30,901 - INFO - Block size: 512
+2025-12-09 15:49:30,901 - INFO - Warmup steps: 1000
+2025-12-09 15:49:30,901 - INFO - Peak learning rate: 0.0005
+2025-12-09 15:49:30,901 - INFO - Total training steps: 3500
+2025-12-09 15:49:30,901 - INFO - Predict every: 500 steps
+2025-12-09 15:49:30,901 - INFO - ================================================================================
+2025-12-09 15:59:45,441 - INFO - Step 2000 | train_loss=0.9070
+2025-12-09 15:59:47,487 - INFO -
+================================================================================
+2025-12-09 15:59:47,487 - INFO - Step 2000 - Generated text:
+2025-12-09 15:59:47,488 - INFO - First Citizen:
+Why, no; but the Hortenspur, and
+To perricks. Thou art said so when a king
+Hadst thouable to be ruled, and not to forget
+At any man.
+First Citizen:
+None,
+2025-12-09 15:59:47,488 - INFO - ================================================================================

main.py ADDED Viewed

	@@ -0,0 +1,6 @@

+def main():
+    print("Hello from smollm-135!")
+if __name__ == "__main__":
+    main()

model.py ADDED Viewed

	@@ -0,0 +1,589 @@

+# Minimal SmolLM2-135M style model implemented in PyTorch.
+# Architecture: LLaMA-style decoder-only Transformer with:
+# - RMSNorm
+# - RoPE positional encoding
+# - SwiGLU MLP
+# - Grouped (GQA/MQA) attention: num_attention_heads != num_key_value_heads
+#
+# This file is self-contained (except PyTorch) and can be used as:
+#
+#   from model import SmolConfig, SmolLM2
+#
+#   cfg = SmolConfig.from_hf("HuggingFaceTB/SmolLM2-135M")
+#   model = SmolLM2(cfg)
+from dataclasses import dataclass
+from typing import Optional, Tuple, List
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+# =========================
+# 1. Config
+# Got config from HuggingFace Using:  transformers.AutoConfig.from_pretrained("HuggingFaceTB/SmolLM2-135M")
+# Config: SmolLM2-135M
+# LlamaConfig {
+#   "architectures": [
+#     "LlamaForCausalLM"
+#   ],
+#   "attention_bias": false,
+#   "attention_dropout": 0.0,
+#   "bos_token_id": 0,
+#   "dtype": "bfloat16",
+#   "eos_token_id": 0,
+#   "head_dim": 64,
+#   "hidden_act": "silu",
+#   "hidden_size": 576,
+#   "initializer_range": 0.041666666666666664,
+#   "intermediate_size": 1536,
+#   "is_llama_config": true,
+#   "max_position_embeddings": 8192,
+#   "mlp_bias": false,
+#   "model_type": "llama",
+#   "num_attention_heads": 9,
+#   "num_hidden_layers": 30,
+#   "num_key_value_heads": 3,
+#   "pretraining_tp": 1,
+#   "rms_norm_eps": 1e-05,
+#   "rope_interleaved": false,
+#   "rope_scaling": null,
+#   "rope_theta": 100000,
+#   "tie_word_embeddings": true,
+#   "transformers_version": "4.57.3",
+#   "use_cache": true,
+#   "vocab_size": 49152
+# }
+# =========================
+@dataclass
+class SmolConfig:
+    # Core dimensions
+    vocab_size: int = 49152          # from HF config
+    hidden_size: int = 576           # "hidden_size"
+    intermediate_size: int = 1536    # "intermediate_size"
+    num_hidden_layers: int = 30      # "num_hidden_layers"
+    num_attention_heads: int = 9     # "num_attention_heads"
+    num_key_value_heads: int = 3     # "num_key_value_heads"
+    max_position_embeddings: int = 8192  # "max_position_embeddings"
+    # Positional / RoPE
+    rope_theta: float = 100000.0     # "rope_theta"
+    # Norm / numerical
+    rms_norm_eps: float = 1e-5       # "rms_norm_eps"
+    # Biases
+    attention_bias: bool = False     # "attention_bias"
+    mlp_bias: bool = False           # "mlp_bias"
+    # Misc
+    dtype: torch.dtype = torch.bfloat16
+    @property
+    def head_dim(self) -> int:
+        # Should be 64 for SmolLM2-135M (576 / 9).
+        return self.hidden_size // self.num_attention_heads # 576 / 9 = 64
+    @classmethod
+    def from_hf(cls, hf_config) -> "SmolConfig":
+        """
+        Helper to build this config from a transformers LlamaConfig (Which is the config for the HuggingFace SmolLM2-135M model).
+        Example:
+            from transformers import AutoConfig
+            hf = AutoConfig.from_pretrained("HuggingFaceTB/SmolLM2-135M")
+            cfg = SmolConfig.from_hf(hf)
+        And then pass this config to this function call to set the config for the model.
+        """
+        return cls(
+            vocab_size=hf_config.vocab_size,
+            hidden_size=hf_config.hidden_size,
+            intermediate_size=hf_config.intermediate_size,
+            num_hidden_layers=hf_config.num_hidden_layers,
+            num_attention_heads=hf_config.num_attention_heads,
+            num_key_value_heads=getattr(hf_config, "num_key_value_heads",
+                                        hf_config.num_attention_heads),
+            max_position_embeddings=hf_config.max_position_embeddings,
+            rope_theta=getattr(hf_config, "rope_theta", 10000.0),
+            rms_norm_eps=hf_config.rms_norm_eps,
+            attention_bias=getattr(hf_config, "attention_bias", False),
+            mlp_bias=getattr(hf_config, "mlp_bias", False),
+            dtype=torch.bfloat16,  # SmolLM2 uses bfloat16
+        )
+# =========================
+# 2. RMSNorm
+# =========================
+class RMSNorm(nn.Module):
+    """
+    Root Mean Square Layer Normalization (RMSNorm)
+    Used in LLaMA / SmolLM2 instead of LayerNorm.
+    """
+    def __init__(self, dim: int, eps: float = 1e-5):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # x: (..., dim)
+        # rms = sqrt(mean(x^2)), but we can use rsqrt for stability
+        norm = x.pow(2).mean(dim=-1, keepdim=True)
+        x = x * torch.rsqrt(norm + self.eps)
+        return self.weight * x
+# =========================
+# 3. RoPE (Rotary Positional Embeddings)
+# =========================
+def rope_freqs(head_dim: int, base: float, device, dtype):
+    """
+    Compute inverse frequencies for RoPE.
+    """
+    half_dim = head_dim // 2
+    # Equivalent to: base^{ -2i / d }
+    freq_seq = torch.arange(half_dim, device=device, dtype=dtype)
+    inv_freq = 1.0 / (base ** (freq_seq / half_dim))
+    return inv_freq  # shape: (half_dim,)
+def build_rope_cache(
+    seq_len: int,
+    head_dim: int,
+    base: float,
+    device,
+    dtype,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """
+    Build cosine and sine caches for RoPE.
+    Returns:
+        cos: (1, 1, seq_len, head_dim/2)
+        sin: (1, 1, seq_len, head_dim/2)
+    """
+    inv_freq = rope_freqs(head_dim, base, device, dtype)   # (half_dim,)
+    # Positions
+    t = torch.arange(seq_len, device=device, dtype=dtype)  # (seq_len,)
+    freqs = torch.outer(t, inv_freq)                      # (seq_len, half_dim)
+    cos = freqs.cos()[None, None, :, :]                   # (1,1,seq_len,half_dim)
+    sin = freqs.sin()[None, None, :, :]                   # (1,1,seq_len,half_dim)
+    return cos, sin
+def apply_rope(
+    x: torch.Tensor,  # (B, n_head, T, head_dim)
+    cos: torch.Tensor,
+    sin: torch.Tensor,
+) -> torch.Tensor:
+    """
+    Apply RoPE to last dimension of x.
+    cos, sin are broadcast to match (..., head_dim/2).
+    """
+    b, h, t, d = x.shape
+    half = d // 2
+    x1 = x[..., :half] # (B, n_head, T, head_dim/2)
+    x2 = x[..., half:] # (B, n_head, T, head_dim/2)
+    # cos/sin: (1,1,T,half) -> broadcast over B,h
+    cos_t = cos[..., :t, :]
+    sin_t = sin[..., :t, :]
+    x1_rot = x1 * cos_t - x2 * sin_t
+    x2_rot = x1 * sin_t + x2 * cos_t
+    return torch.cat([x1_rot, x2_rot], dim=-1) # (B, n_head, T, head_dim)
+# =========================
+# 4. Attention
+# =========================
+class MultiHeadSelfAttention(nn.Module):
+    """
+    LLaMA / SmolLM2-style attention with:
+    - Q heads = num_attention_heads
+    - K/V heads = num_key_value_heads (GQA/MQA)
+    - RoPE on Q and K
+    - Causal masking
+    """
+    def __init__(self, config: SmolConfig):
+        super().__init__()
+        self.config = config
+        self.n_heads = config.num_attention_heads # 9
+        self.n_kv_heads = config.num_key_value_heads # 3
+        self.head_dim = config.head_dim # 64
+        self.hidden_size = config.hidden_size # 576
+        assert self.hidden_size == self.n_heads * self.head_dim
+        # Projections
+        self.q_proj = nn.Linear(
+            self.hidden_size,
+            self.n_heads * self.head_dim,
+            bias=config.attention_bias,
+        )
+        self.k_proj = nn.Linear(
+            self.hidden_size,
+            self.n_kv_heads * self.head_dim,
+            bias=config.attention_bias,
+        )
+        self.v_proj = nn.Linear(
+            self.hidden_size,
+            self.n_kv_heads * self.head_dim,
+            bias=config.attention_bias,
+        )
+        self.o_proj = nn.Linear(
+            self.n_heads * self.head_dim,
+            self.hidden_size,
+            bias=config.attention_bias,
+        )
+    def forward(
+        self,
+        x: torch.Tensor,                # (B, T, C) or (B, 1, C) for inference
+        cos: torch.Tensor,              # (1,1,T,head_dim/2) or (1,1,1,head_dim/2) for inference
+        sin: torch.Tensor,              # (1,1,T,head_dim/2) or (1,1,1,head_dim/2) for inference
+        attention_mask: Optional[torch.Tensor] = None,  # (B, T) or (B,1,1,T)
+        past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # (k_cache, v_cache)
+        use_cache: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]]]:
+        B, T, C = x.shape
+        # Projections: (B,T,C) -> (B,T,h,d) -> (B,h,T,d)
+        q = self.q_proj(x).view(B, T, self.n_heads, self.head_dim).transpose(1, 2) # (B,T,C) -> (B,T,h*d) -> (B,T,h,d) -> (B,h,T,d)
+        k = self.k_proj(x).view(B, T, self.n_kv_heads, self.head_dim).transpose(1, 2) # (B,T,C) -> (B,T,k*d) -> (B,T,k,d) -> (B,k,T,d)
+        v = self.v_proj(x).view(B, T, self.n_kv_heads, self.head_dim).transpose(1, 2) # (B,T,C) -> (B,T,v*d) -> (B,T,v,d) -> (B,v,T,d)
+        # Apply RoPE to Q and K
+        q = apply_rope(q, cos, sin)  # (B, h, T, d)
+        k = apply_rope(k, cos, sin)  # (B, n_kv_heads, T, d)
+        # v doesn't need RoPE
+        # If using KV cache, concatenate with past keys/values
+        if past_key_value is not None:
+            past_k, past_v = past_key_value
+            # past_k, past_v: (B, n_kv_heads, past_len, head_dim)
+            k = torch.cat([past_k, k], dim=2)  # (B, n_kv_heads, past_len + T, head_dim)
+            v = torch.cat([past_v, v], dim=2)  # (B, n_kv_heads, past_len + T, head_dim)
+            seq_len = k.shape[2]
+        else:
+            seq_len = T
+        # Store k, v for cache (before GQA expansion)
+        k_cache = k  # (B, n_kv_heads, seq_len, head_dim)
+        v_cache = v  # (B, n_kv_heads, seq_len, head_dim)
+        # GQA: expand K/V if num_kv_heads < num_heads
+        if self.n_kv_heads != self.n_heads:
+            repeat_factor = self.n_heads // self.n_kv_heads
+            k = k.repeat_interleave(repeat_factor, dim=1)  # (B, n_kv_heads, seq_len, d) -> (B, n_heads, seq_len, d)
+            v = v.repeat_interleave(repeat_factor, dim=1)  # (B, n_kv_heads, seq_len, d) -> (B, n_heads, seq_len, d)
+        # Attention scores: (B,h,T,d) @ (B,h,d,seq_len) -> (B,h,T,seq_len)
+        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
+        # Causal mask: prevent attending to future tokens
+        # For inference with KV cache, we only need to mask the current position
+        if past_key_value is None:
+            # Full sequence: mask all future positions
+            causal_mask = torch.full(
+                (T, T), float("-inf"), device=x.device, dtype=x.dtype
+            ).triu(1)  # upper triangle (i < j)
+            scores = scores + causal_mask.unsqueeze(0).unsqueeze(0) # (B,h,T,T) + (1,1,T,T) -> (B,h,T,T)
+        else:
+            # With KV cache: only mask positions beyond current (shouldn't happen, but safety)
+            # Since we're generating one token at a time, T=1, and we attend to all past + current
+            pass
+        # Optional attention mask (e.g., padding). Should be additive (0 or -inf).
+        if attention_mask is not None:
+            # Expect attention_mask as (B, 1, 1, seq_len) or (B, seq_len)
+            if attention_mask.dim() == 2:
+                # (B, seq_len) -> (B,1,1,seq_len)
+                attention_mask = attention_mask[:, None, None, :]
+            # Adjust mask shape if needed
+            if attention_mask.shape[-1] != seq_len:
+                # For inference, we might need to extend the mask
+                if past_key_value is not None:
+                    # Extend mask to include past positions (all 0s for past, current mask for new token)
+                    past_len = past_k.shape[2]
+                    extended_mask = torch.zeros(B, 1, 1, seq_len, device=attention_mask.device, dtype=attention_mask.dtype)
+                    extended_mask[..., past_len:] = attention_mask[..., -T:]
+                    attention_mask = extended_mask
+            scores = scores + attention_mask
+        # Softmax over last dim (seq_len)
+        probs = F.softmax(scores, dim=-1)  # (B,h,T,seq_len) -> (B,h,T,seq_len)
+        # Weighted sum of values
+        out = torch.matmul(probs, v)  # (B,h,T,seq_len) @ (B,h,seq_len,d) -> (B,h,T,d)
+        # Reshape back: (B,T,C)
+        out = out.transpose(1, 2).contiguous().view(B, T, C) # (B,h,T,d) -> (B,T,h,d) -> (B,T,h*d) -> (B,T,C)
+        out = self.o_proj(out) # (B,T,C) -> (B,T,C)
+        # Return output and optionally the new KV cache
+        present_key_value = None
+        if use_cache:
+            # Return k_cache, v_cache (before GQA expansion, after RoPE)
+            present_key_value = (k_cache, v_cache)
+        return out, present_key_value
+# =========================
+# 5. MLP (SwiGLU)
+# =========================
+class SmolMLP(nn.Module):
+    """
+    SwiGLU MLP:
+        z = W1(x) -> split -> (x1, x2)
+        out = W2( SiLU(x1) * x2 )
+    """
+    def __init__(self, config: SmolConfig):
+        super().__init__()
+        self.fc1 = nn.Linear(
+            config.hidden_size,
+            2 * config.intermediate_size,   # for SwiGLU split (2 x 1536 = 3072)
+            bias=config.mlp_bias,
+        )
+        self.fc2 = nn.Linear(
+            config.intermediate_size,   # 1536
+            config.hidden_size,   # 576
+            bias=config.mlp_bias,
+        )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.fc1(x)# (B,T,C) -> (B,T,2*intermediate_size) -> (B,T,1536*2) -> (B,T,3072)
+        x1, x2 = x.chunk(2, dim=-1)  # (B,T,2*intermediate_size) = (B,T,3072) -> (B,T,intermediate), (B,T,intermediate) = (B,T,1536), (B,T,1536)
+        return self.fc2(F.silu(x1) * x2) # (B,T,intermediate) * (B,T,intermediate) -> (B,T,intermediate) -> (B,T,hidden_size) = (B,T,576)
+# =========================
+# 6. Transformer Block
+# =========================
+class SmolBlock(nn.Module):
+    def __init__(self, config: SmolConfig):
+        super().__init__()
+        self.attn_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.attn = MultiHeadSelfAttention(config)
+        self.mlp_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.mlp = SmolMLP(config)
+    def forward(
+        self,
+        x: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]]]:
+        # Pre-norm + residual for attention
+        attn_out, present_key_value = self.attn(
+            self.attn_norm(x), cos, sin, attention_mask, past_key_value, use_cache
+        )
+        x = x + attn_out
+        # Pre-norm + residual for MLP
+        x = x + self.mlp(self.mlp_norm(x))
+        return x, present_key_value
+# =============================================
+# 7. Top-level SmolLM2-135M Model Architecture
+#  SmolLM2 follows the LLaMA-style decoder-only Transformer architecture.
+# =============================================
+class SmolLM2(nn.Module):
+    """
+    SmolLM2-135M-style LLaMA decoder-only language model.
+    Usage:
+        cfg = SmolConfig()
+        model = SmolLM2(cfg)
+        input_ids: LongTensor (B, T)
+        logits = model(input_ids)
+    """
+    def __init__(self, config: SmolConfig):
+        super().__init__()
+        self.config = config
+        self.embed_tokens = nn.Embedding(
+            config.vocab_size,
+            config.hidden_size,
+        ) # (Vocab_Size, Hidden_Size) (49152 x 576)
+        self.layers = nn.ModuleList(
+            [SmolBlock(config) for _ in range(config.num_hidden_layers)]
+        )
+        self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.lm_head = nn.Linear(
+            config.hidden_size,
+            config.vocab_size,
+            bias=False,
+        ) # (Hidden_Size, Vocab_Size) (576 x 49152)
+        # tie weights
+        self.lm_head.weight = self.embed_tokens.weight
+    def forward(
+        self,
+        input_ids: torch.Tensor,            # (B, T)
+        attention_mask: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = None,
+        use_cache: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[List[Tuple[torch.Tensor, torch.Tensor]]]]:
+        B, T = input_ids.shape
+        # For inference with KV cache, we might have T=1
+        if past_key_values is None:
+            assert T <= self.config.max_position_embeddings, (
+                f"Sequence length {T} exceeds max_position_embeddings "
+                f"{self.config.max_position_embeddings}"
+            )
+            seq_len = T
+        else:
+            # With KV cache, current sequence length is past_len + T
+            past_len = past_key_values[0][0].shape[2] if past_key_values[0] is not None else 0
+            seq_len = past_len + T
+            assert seq_len <= self.config.max_position_embeddings, (
+                f"Total sequence length {seq_len} exceeds max_position_embeddings "
+                f"{self.config.max_position_embeddings}"
+            )
+        # Embedding
+        x = self.embed_tokens(input_ids)  # (B,T) -> (B,T,C)
+        # RoPE cache - build for the full sequence length (past + current)
+        cos, sin = build_rope_cache(
+            seq_len=seq_len,
+            head_dim=self.config.head_dim,
+            base=self.config.rope_theta,
+            device=x.device,
+            dtype=x.dtype,
+        )
+        # If using KV cache, we only need cos/sin for current positions
+        if past_key_values is not None:
+            past_len = past_key_values[0][0].shape[2] if past_key_values[0] is not None else 0
+            # Slice to get only the current positions for RoPE
+            cos = cos[..., past_len:, :]
+            sin = sin[..., past_len:, :]
+        # Layers
+        present_key_values = [] if use_cache else None
+        for i, layer in enumerate(self.layers):
+            past_kv = past_key_values[i] if past_key_values is not None else None
+            x, present_kv = layer(x, cos, sin, attention_mask, past_kv, use_cache)
+            if use_cache:
+                present_key_values.append(present_kv)
+        # Final norm + lm head
+        x = self.norm(x)
+        logits = self.lm_head(x)  # (B,T,C) -> (B,T,vocab_size)
+        return logits, present_key_values
+    @torch.no_grad()
+    def generate(
+        self,
+        input_ids: torch.Tensor,
+        max_new_tokens: int = 100,
+        temperature: float = 1.0,
+        top_k: Optional[int] = None,
+        top_p: Optional[float] = None,
+        eos_token_id: Optional[int] = None,
+    ) -> torch.Tensor:
+        """
+        Generate text using KV cache for efficient inference.
+        Args:
+            input_ids: (B, T) input token ids
+            max_new_tokens: maximum number of new tokens to generate
+            temperature: sampling temperature
+            top_k: top-k sampling (keep top k tokens)
+            top_p: nucleus sampling (keep tokens with cumulative probability <= top_p)
+            eos_token_id: end-of-sequence token id (stop generation when encountered)
+        Returns:
+            generated_ids: (B, T + max_new_tokens) generated token ids
+        """
+        self.eval()
+        device = input_ids.device
+        B, T = input_ids.shape
+        # Start with input_ids
+        generated_ids = input_ids.clone()
+        past_key_values = None
+        for step in range(max_new_tokens):
+            # Forward pass with KV cache
+            # On first iteration, use full input_ids. On subsequent iterations, use only last token
+            if past_key_values is None:
+                # First iteration: process full sequence
+                current_input = generated_ids
+            else:
+                # Subsequent iterations: only process the last generated token
+                current_input = generated_ids[:, -1:]
+            logits, past_key_values = self.forward(
+                input_ids=current_input,
+                past_key_values=past_key_values,
+                use_cache=True,
+            )
+            # Get logits for the last token (always the last position in logits)
+            next_token_logits = logits[:, -1, :] / temperature
+            # Apply top-k filtering
+            if top_k is not None:
+                indices_to_remove = next_token_logits < torch.topk(next_token_logits, top_k)[0][..., -1, None]
+                next_token_logits[indices_to_remove] = float('-inf')
+            # Apply top-p (nucleus) filtering
+            if top_p is not None:
+                sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
+                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+                # Remove tokens with cumulative probability above the threshold
+                sorted_indices_to_remove = cumulative_probs > top_p
+                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+                sorted_indices_to_remove[..., 0] = 0
+                indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
+                next_token_logits[indices_to_remove] = float('-inf')
+            # Sample next token
+            probs = F.softmax(next_token_logits, dim=-1)
+            next_token = torch.multinomial(probs, num_samples=1)  # (B, 1)
+            # Append to generated sequence
+            generated_ids = torch.cat([generated_ids, next_token], dim=1)
+            # Check for EOS token
+            if eos_token_id is not None and (next_token == eos_token_id).all():
+                break
+        return generated_ids
+# =========================
+# 8. Quick self-test
+# =========================
+if __name__ == "__main__":
+    # Tiny sanity check: runs a forward pass on random input
+    cfg = SmolConfig()
+    model = SmolLM2(cfg)
+    B, T = 2, 16
+    x = torch.randint(0, cfg.vocab_size, (B, T))
+    with torch.no_grad():
+        logits, _ = model(x)
+    print("Input shape :", x.shape)
+    print("Logits shape:", logits.shape)  # should be (2, 16, vocab_size)

pyproject.toml ADDED Viewed

	@@ -0,0 +1,17 @@

+[project]
+name = "smollm-135"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "lightning>=2.6.0",
+    "tensorboard>=2.20.0",
+    "torch>=2.9.1",
+    "torchinfo>=1.8.0",
+    "torchmetrics>=1.8.2",
+    "torchsummary>=1.5.1",
+    "torchvision>=0.24.1",
+    "tqdm>=4.67.1",
+    "transformers>=4.57.3",
+]

test_model_implementation.py ADDED Viewed

	@@ -0,0 +1,187 @@

+import sys
+import torch
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+from model import SmolLM2, SmolConfig  # your implementation
+PRETRAINED_NAME = "HuggingFaceTB/SmolLM2-135M"
+def build_custom_model():
+    """Create our SmolLM2 using HF config to ensure identical hyperparams."""
+    hf_cfg = AutoConfig.from_pretrained(PRETRAINED_NAME)
+    cfg = SmolConfig.from_hf(hf_cfg)
+    model = SmolLM2(cfg)
+    return model, cfg
+def build_hf_model():
+    """Load reference HF model."""
+    hf_model = AutoModelForCausalLM.from_pretrained(
+        PRETRAINED_NAME,
+        torch_dtype=torch.float32,  # use float32 for easier comparison
+    )
+    hf_model.eval()
+    return hf_model
+def load_weights_from_hf(custom_model: SmolLM2, hf_model: AutoModelForCausalLM):
+    """
+    Map HF LlamaForCausalLM weights into our SmolLM2 model.
+    - HF model structure: hf_model.model (LlamaModel) + hf_model.lm_head
+    - Our model: embed_tokens, layers, norm, lm_head
+    """
+    hf_state = hf_model.state_dict()
+    custom_state = custom_model.state_dict()
+    # 1. Embeddings
+    custom_state["embed_tokens.weight"] = hf_state["model.embed_tokens.weight"]
+    # 2. Per-layer mappings
+    num_layers = custom_model.config.num_hidden_layers
+    for i in range(num_layers):
+        # Norms
+        custom_state[f"layers.{i}.attn_norm.weight"] = hf_state[
+            f"model.layers.{i}.input_layernorm.weight"
+        ]
+        custom_state[f"layers.{i}.mlp_norm.weight"] = hf_state[
+            f"model.layers.{i}.post_attention_layernorm.weight"
+        ]
+        # Attention projections
+        custom_state[f"layers.{i}.attn.q_proj.weight"] = hf_state[
+            f"model.layers.{i}.self_attn.q_proj.weight"
+        ]
+        custom_state[f"layers.{i}.attn.k_proj.weight"] = hf_state[
+            f"model.layers.{i}.self_attn.k_proj.weight"
+        ]
+        custom_state[f"layers.{i}.attn.v_proj.weight"] = hf_state[
+            f"model.layers.{i}.self_attn.v_proj.weight"
+        ]
+        custom_state[f"layers.{i}.attn.o_proj.weight"] = hf_state[
+            f"model.layers.{i}.self_attn.o_proj.weight"
+        ]
+        # MLP: HF has gate_proj, up_proj, down_proj
+        gate = hf_state[f"model.layers.{i}.mlp.gate_proj.weight"]
+        up = hf_state[f"model.layers.{i}.mlp.up_proj.weight"]
+        down = hf_state[f"model.layers.{i}.mlp.down_proj.weight"]
+        # Our fc1 is [gate; up] concatenated along output dim (dim=0)
+        custom_state[f"layers.{i}.mlp.fc1.weight"] = torch.cat([gate, up], dim=0)
+        # Our fc2 is down_proj
+        custom_state[f"layers.{i}.mlp.fc2.weight"] = down
+    # 3. Final norm
+    custom_state["norm.weight"] = hf_state["model.norm.weight"]
+    # 4. LM head (tied with embeddings, but we still load it)
+    custom_state["lm_head.weight"] = hf_state["lm_head.weight"]
+    # Now load into the model
+    missing, unexpected = custom_model.load_state_dict(custom_state, strict=False)
+    return missing, unexpected
+def test_weight_loading():
+    """
+    1. Build custom SmolLM2 model (our implementation).
+    2. Build HF reference model.
+    3. Load HF weights into our model via mapping.
+    4. Run a small test prompt and compare logits.
+    """
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    print(f"Using device: {device}")
+    print("🟦 Building custom model...")
+    custom_model, cfg = build_custom_model()
+    custom_model.to(device)
+    custom_model.eval()
+    print("🟦 Building HF reference model...")
+    hf_model = build_hf_model()
+    hf_model.to(device)
+    print("🟦 Mapping HF weights into custom model...")
+    missing, unexpected = load_weights_from_hf(custom_model, hf_model)
+    print(f"Missing keys    : {len(missing)}")
+    print(f"Unexpected keys : {len(unexpected)}")
+    if missing:
+        print("  Missing examples:", missing[:5])
+    if unexpected:
+        print("  Unexpected examples:", unexpected[:5])
+    if len(missing) > 0:
+        print("⚠️ There are missing keys; mapping may be incomplete.")
+    else:
+        print("✅ All expected parameters were assigned from HF weights.")
+    # 5. Test with a dummy input
+    tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_NAME)
+    prompt = "Hello, how are you?"
+    inputs = tokenizer(prompt, return_tensors="pt").to(device)
+    print("🟦 Running HF model forward...")
+    with torch.no_grad():
+        hf_logits = hf_model(**inputs).logits  # (B, T, V)
+    print("🟦 Running custom model forward...")
+    with torch.no_grad():
+        custom_logits, _ = custom_model(inputs["input_ids"])
+    # 6. Compare logits
+    # align dtypes
+    hf_logits = hf_logits.to(torch.float32)
+    custom_logits = custom_logits.to(torch.float32)
+    diff = torch.abs(hf_logits - custom_logits).max().item()
+    print(f"🔍 Max absolute difference between logits: {diff:.6f}")
+    if diff < 1e-4:
+        print("✅ SUCCESS: Outputs match very closely. Implementation is correct.")
+    elif diff < 1e-2:
+        print("🟡 Outputs are close but not identical; check for small implementation differences (e.g., RoPE details).")
+    else:
+        print("❌ Outputs differ significantly. Some part of the implementation is likely off.")
+    # 7. Print predictions from both models
+    print("\n📝 Predictions:")
+    print(f"Prompt: '{prompt}'")
+    # Get predicted token IDs (argmax on vocabulary dimension)
+    hf_predicted_ids = hf_logits.argmax(dim=-1)  # (B, T)
+    custom_predicted_ids = custom_logits.argmax(dim=-1)  # (B, T)
+    # Get the next token prediction (last position)
+    hf_next_token_id = hf_predicted_ids[0, -1].item()
+    custom_next_token_id = custom_predicted_ids[0, -1].item()
+    # Decode the next token
+    hf_next_token = tokenizer.decode([hf_next_token_id])
+    custom_next_token = tokenizer.decode([custom_next_token_id])
+    print(f"HF Model prediction (next token): '{hf_next_token}' (token_id: {hf_next_token_id})")
+    print(f"Custom Model prediction (next token): '{custom_next_token}' (token_id: {custom_next_token_id})")
+    # Also show full sequence predictions for comparison
+    hf_full_prediction = tokenizer.decode(hf_predicted_ids[0])
+    custom_full_prediction = tokenizer.decode(custom_predicted_ids[0])
+    print(f"\nHF Model full sequence prediction: '{hf_full_prediction}'")
+    print(f"Custom Model full sequence prediction: '{custom_full_prediction}'")
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print("Usage: python test_model_implementation.py test_weight_loading")
+        sys.exit(1)
+    mode = sys.argv[1]
+    if mode == "test_weight_loading":
+        test_weight_loading()
+    else:
+        print(f"Unknown mode: {mode}")

train.py ADDED Viewed

	@@ -0,0 +1,360 @@

+"""
+Training script for SmolLM2-135M using PyTorch Lightning.
+Training strategy from paper:
+- AdamW optimizer with (β1, β2) = (0.9, 0.95)
+- Warmup Stable Decay (WSD) learning rate schedule:
+  - 2,000-step warmup phase
+  - Peak learning rate: 5.0 × 10^-4 (stable phase)
+  - Decay phase: reduce LR to zero over 10% of total training steps
+"""
+import sys
+import logging
+from pathlib import Path
+from datetime import datetime
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader
+import lightning as L
+from lightning.pytorch.callbacks import ModelCheckpoint, LearningRateMonitor
+from lightning.pytorch.loggers import TensorBoardLogger
+from transformers import AutoTokenizer, AutoConfig
+from model import SmolLM2, SmolConfig
+# Setup logging
+def setup_logging(log_dir: Path):
+    """Setup text file logging."""
+    log_dir.mkdir(parents=True, exist_ok=True)
+    log_file = log_dir / f"training_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
+    logging.basicConfig(
+        level=logging.INFO,
+        format='%(asctime)s - %(levelname)s - %(message)s',
+        handlers=[
+            logging.FileHandler(log_file),
+            logging.StreamHandler(sys.stdout)
+        ]
+    )
+    return logging.getLogger(__name__), log_file
+class TextDataset(Dataset):
+    """Dataset for text data."""
+    def __init__(self, text_file: str, tokenizer, block_size: int = 512):
+        self.tokenizer = tokenizer
+        self.block_size = block_size
+        # Read and tokenize text
+        with open(text_file, 'r', encoding='utf-8') as f:
+            text = f.read()
+        # Tokenize
+        tokens = tokenizer.encode(text, add_special_tokens=False)
+        self.data = torch.tensor(tokens, dtype=torch.long)
+    def __len__(self):
+        return len(self.data) - self.block_size
+    def __getitem__(self, idx):
+        chunk = self.data[idx:idx + self.block_size + 1]
+        x = chunk[:-1]
+        y = chunk[1:]
+        return x, y
+class WarmupStableDecayLR(L.Callback):
+    """
+    Warmup Stable Decay (WSD) learning rate schedule.
+    - Warmup: 2000 steps in paper, Since only training for 5000 steps, we will use 20% of total steps as warmup steps (1000 steps)
+    - Stable: maintain peak LR
+    - Decay: reduce to zero over 10% of total steps
+    """
+    def __init__(self, warmup_steps: int = 2000, peak_lr: float = 5e-4, total_steps: int = 5000):
+        super().__init__()
+        self.warmup_steps = warmup_steps
+        self.peak_lr = peak_lr
+        self.total_steps = total_steps
+        self.decay_steps = int(0.1 * total_steps)  # 10% of total steps
+        self.stable_steps = total_steps - warmup_steps - self.decay_steps
+    def on_train_batch_start(self, trainer, pl_module, batch, batch_idx):
+        current_step = trainer.global_step
+        if current_step < self.warmup_steps:
+            # Warmup phase: linear increase
+            lr = self.peak_lr * (current_step / self.warmup_steps)
+        elif current_step < self.warmup_steps + self.stable_steps:
+            # Stable phase: maintain peak LR
+            lr = self.peak_lr
+        else:
+            # Decay phase: linear decrease to zero
+            decay_start = self.warmup_steps + self.stable_steps
+            decay_progress = (current_step - decay_start) / self.decay_steps
+            lr = self.peak_lr * (1.0 - decay_progress)
+        # Update learning rate
+        optimizer = pl_module.optimizers()
+        if isinstance(optimizer, torch.optim.Optimizer):
+            for param_group in optimizer.param_groups:
+                param_group['lr'] = lr
+        else:
+            # If it's a list or other structure
+            for opt in optimizer:
+                for param_group in opt.param_groups:
+                    param_group['lr'] = lr
+class SmolLM2Module(L.LightningModule):
+    """PyTorch Lightning module for SmolLM2 training."""
+    def __init__(
+        self,
+        config: SmolConfig,
+        tokenizer,
+        block_size: int = 512,
+        warmup_steps: int = 2000,
+        peak_lr: float = 5e-4,
+        total_steps: int = 5000,
+        predict_every: int = 500,
+    ):
+        super().__init__()
+        self.save_hyperparameters(ignore=['tokenizer'])
+        self.config = config
+        self.tokenizer = tokenizer
+        self.block_size = block_size
+        self.warmup_steps = warmup_steps
+        self.peak_lr = peak_lr
+        self.total_steps = total_steps
+        self.predict_every = predict_every
+        # Initialize model
+        self.model = SmolLM2(config)
+        # Loss function
+        self.criterion = nn.CrossEntropyLoss()
+        # For generation
+        self.example_prompt = "First Citizen:"
+    def forward(self, input_ids, attention_mask=None):
+        logits, present_key_values = self.model(input_ids, attention_mask=attention_mask, use_cache=False)
+        return logits
+    def training_step(self, batch, batch_idx):
+        x, y = batch
+        logits = self.forward(x)
+        # Reshape for loss calculation
+        loss = self.criterion(logits.view(-1, logits.size(-1)), y.view(-1))
+        # Logging
+        self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
+        # Generate text every predict_every steps
+        if (self.global_step + 1) % self.predict_every == 0:
+            # Log scalar loss to text log so it shows up with generations
+            logger.info(f"Step {self.global_step + 1} | train_loss={loss.item():.4f}")
+            self.generate_and_log()
+        return loss
+    def generate_and_log(self):
+        """Generate text and log it."""
+        self.model.eval()
+        with torch.no_grad():
+            # Tokenize prompt
+            prompt_ids = self.tokenizer.encode(
+                self.example_prompt,
+                return_tensors='pt',
+                add_special_tokens=False
+            ).to(self.device)
+            # Generate
+            generated_ids = self.model.generate(
+                prompt_ids,
+                max_new_tokens=50,
+                temperature=0.8,
+                top_k=50,
+            )
+            # Decode
+            generated_text = self.tokenizer.decode(
+                generated_ids[0].cpu().tolist(),
+                skip_special_tokens=True
+            )
+            # Log to console and file
+            logger.info(f"\n{'='*80}")
+            logger.info(f"Step {self.global_step + 1} - Generated text:")
+            logger.info(f"{generated_text}")
+            logger.info(f"{'='*80}\n")
+        self.model.train()
+    def configure_optimizers(self):
+        """Configure optimizer with AdamW."""
+        optimizer = torch.optim.AdamW(
+            self.parameters(),
+            lr=self.peak_lr,  # Will be adjusted by scheduler
+            betas=(0.9, 0.95),
+            weight_decay=0.01,
+        )
+        # WSD scheduler (implemented as callback)
+        return optimizer
+    def on_train_start(self):
+        """Log model summary at training start."""
+        # Count parameters
+        total_params = sum(p.numel() for p in self.model.parameters())
+        trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
+        logger.info("\n" + "="*80)
+        logger.info("MODEL SUMMARY")
+        logger.info("="*80)
+        logger.info(f"Model: SmolLM2-135M")
+        logger.info(f"Total parameters: {total_params:,}")
+        logger.info(f"Trainable parameters: {trainable_params:,}")
+        logger.info(f"Block size: {self.block_size}")
+        logger.info(f"Warmup steps: {self.warmup_steps}")
+        logger.info(f"Peak learning rate: {self.peak_lr}")
+        logger.info(f"Total training steps: {self.total_steps}")
+        logger.info(f"Predict every: {self.predict_every} steps")
+        logger.info("="*80 + "\n")
+def main():
+    # Configuration
+    data_file = Path("../data/input.txt").resolve()
+    output_dir = Path("./checkpoints")
+    log_dir = Path("./logs")
+    block_size = 512
+    batch_size = 4
+    num_workers = 8
+    max_steps = 3500
+    predict_every = 500
+    resume_from_checkpoint = "checkpoints/smollm2-step=01500-train_loss=3.6240.ckpt"  # Set to checkpoint path to resume, or None for fresh training
+    # Training hyperparameters from paper
+    warmup_steps = 1000
+    peak_lr = 5e-4
+    total_steps = max_steps
+    # Setup logging
+    global logger
+    logger, log_file = setup_logging(log_dir)
+    logger.info(f"Logging to: {log_file}")
+    # Load tokenizer
+    logger.info("Loading tokenizer...")
+    tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Allow SmolConfig to be deserialized from Lightning checkpoints when torch.load
+    # uses weights_only=True default (torch>=2.6). This is safe because the class
+    # is defined locally in this file.
+    try:
+        torch.serialization.add_safe_globals([SmolConfig])  # type: ignore[attr-defined]
+    except Exception:
+        # Fallback for torch versions without add_safe_globals; Lightning will still
+        # load normally when weights_only=False.
+        pass
+    # Load config and create model config
+    logger.info("Loading model config...")
+    hf_config = AutoConfig.from_pretrained("HuggingFaceTB/SmolLM2-135M")
+    config = SmolConfig.from_hf(hf_config)
+    # Create dataset
+    logger.info(f"Loading dataset from: {data_file}")
+    dataset = TextDataset(data_file, tokenizer, block_size=block_size)
+    dataloader = DataLoader(
+        dataset,
+        batch_size=batch_size,
+        shuffle=True,
+        num_workers=num_workers,
+        pin_memory=True,
+    )
+    # Create Lightning module
+    logger.info("Initializing model...")
+    model = SmolLM2Module(
+        config=config,
+        tokenizer=tokenizer,
+        block_size=block_size,
+        warmup_steps=warmup_steps,
+        peak_lr=peak_lr,
+        total_steps=total_steps,
+        predict_every=predict_every,
+    )
+    # Additional callback to ensure checkpoint at final step
+    class FinalCheckpointCallback(L.Callback):
+        def on_train_end(self, trainer, pl_module):
+            # Save final checkpoint
+            final_checkpoint_path = output_dir / f"smollm2-final-step-{trainer.global_step:05d}.ckpt"
+            trainer.save_checkpoint(str(final_checkpoint_path))
+            logger.info(f"Final checkpoint saved: {final_checkpoint_path}")
+    final_checkpoint_callback = FinalCheckpointCallback()
+    # Setup callbacks
+    checkpoint_callback = ModelCheckpoint(
+        dirpath=output_dir,
+        filename='smollm2-{step:05d}-{train_loss:.4f}',
+        monitor='train_loss',
+        save_top_k=3,
+        mode='min',
+        every_n_train_steps=predict_every,
+        save_last=True,
+        save_on_train_epoch_end=False,  # Save based on steps, not epochs
+    )
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    wsd_scheduler = WarmupStableDecayLR(
+        warmup_steps=warmup_steps,
+        peak_lr=peak_lr,
+        total_steps=total_steps,
+    )
+    # Setup TensorBoard logger
+    tb_logger = TensorBoardLogger(
+        save_dir=log_dir,
+        name='tensorboard',
+    )
+    # Create trainer
+    trainer = L.Trainer(
+        max_steps=max_steps,
+        callbacks=[checkpoint_callback, lr_monitor, wsd_scheduler, final_checkpoint_callback],
+        logger=tb_logger,
+        accelerator='auto',
+        devices='auto',
+        # Set precision depending on device capabilities.
+        # bf16-mixed: CUDA; 32-true: others; MPS supports only 32-true.
+        precision='bf16-mixed' if torch.cuda.is_available() else '32-true',
+        gradient_clip_val=1.0,
+        log_every_n_steps=50,
+        enable_checkpointing=True,
+    )
+    # Train
+    logger.info("Starting training...")
+    if resume_from_checkpoint and Path(resume_from_checkpoint).exists():
+        logger.info(f"Resuming from checkpoint: {resume_from_checkpoint}")
+        trainer.fit(model, dataloader, ckpt_path=resume_from_checkpoint)
+    else:
+        trainer.fit(model, dataloader)
+    logger.info("Training completed!")
+    logger.info(f"Best checkpoint: {checkpoint_callback.best_model_path}")
+    logger.info(f"Last checkpoint: {checkpoint_callback.last_model_path}")
+if __name__ == "__main__":
+    main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff