Spaces:

Sualeh77
/

smollm2-135m-trained-on-tinyShakespear-forfun

Sleeping

App Files Files Community

smollm2-135m-trained-on-tinyShakespear-forfun / README.md

Sualeh Qureshi

Added Huggingface space link in readme

ea344ec 3 days ago

preview code

raw

history blame contribute delete

11.1 kB

	---
	title: SmolLM2-135M Text Generator
	emoji: 🐨
	colorFrom: yellow
	colorTo: blue
	sdk: gradio
	sdk_version: 6.0.1
	app_file: app.py
	pinned: false
	short_description: A Llama based SmolLM2-135M Transformer (Decoder only)
	---

	HuggingFace space for inference demo: https://huggingface.co/spaces/Sualeh77/smollm2-135m-trained-on-tinyShakespear-forfun

	# SmolLM2-135M Implementation

	A from-scratch PyTorch implementation of the SmolLM2-135M language model, following the LLaMA architecture with modern optimizations.

	## Overview

	This repository contains a complete implementation of SmolLM2-135M, a 135 million parameter decoder-only transformer model. The implementation includes:

	- Model Architecture (`model.py`): Complete model definition with KV cache support
	- Training Script (`train.py`): PyTorch Lightning training with WSD scheduler
	- Gradio App (`app.py`): Interactive web interface for text generation

	## Model Architecture (`model.py`)

	### Architecture Components

	The model follows the LLaMA-style decoder-only transformer architecture with the following key components:

	#### 1. SmolConfig (Configuration Class)

	A dataclass that stores all model hyperparameters:

	```python
	@dataclass
	class SmolConfig:
	vocab_size: int = 49152 # Vocabulary size
	hidden_size: int = 576 # Hidden dimension
	intermediate_size: int = 1536 # MLP intermediate dimension
	num_hidden_layers: int = 30 # Number of transformer layers
	num_attention_heads: int = 9 # Number of query heads
	num_key_value_heads: int = 3 # Number of key/value heads (GQA)
	max_position_embeddings: int = 8192 # Maximum sequence length
	rope_theta: float = 100000.0 # RoPE base frequency
	rms_norm_eps: float = 1e-5 # RMSNorm epsilon
	attention_bias: bool = False # Whether to use bias in attention
	mlp_bias: bool = False # Whether to use bias in MLP
	dtype: torch.dtype = torch.bfloat16
	```

	Key Features:
	- `head_dim` property: Automatically computes head dimension (hidden_size // num_attention_heads = 64)
	- `from_hf()` class method: Loads configuration from HuggingFace model config

	#### 2. RMSNorm (Root Mean Square Normalization)

	Replaces LayerNorm with a more efficient normalization:

	```python
	class RMSNorm(nn.Module):
	def forward(self, x):
	norm = x.pow(2).mean(dim=-1, keepdim=True)
	x = x * torch.rsqrt(norm + self.eps)
	return self.weight * x
	```

	Benefits:
	- More efficient than LayerNorm (no mean subtraction)
	- Used throughout the model for pre-norm architecture

	#### 3. RoPE (Rotary Positional Embeddings)

	Rotary Position Embeddings applied to query and key tensors:

	```python
	def build_rope_cache(seq_len, head_dim, base, device, dtype):
	# Computes cosine and sine caches for RoPE
	inv_freq = 1.0 / (base ** (freq_seq / half_dim))
	freqs = torch.outer(t, inv_freq)
	cos = freqs.cos()[None, None, :, :]
	sin = freqs.sin()[None, None, :, :]
	return cos, sin

	def apply_rope(x, cos, sin):
	# Applies rotary transformation to input tensor
	x1, x2 = x[..., :half], x[..., half:]
	x1_rot = x1 * cos - x2 * sin
	x2_rot = x1 * sin + x2 * cos
	return torch.cat([x1_rot, x2_rot], dim=-1)
	```

	Key Features:
	- Relative positional encoding (no absolute position embeddings)
	- Applied only to Q and K (not V)
	- Supports efficient caching for inference

	#### 4. MultiHeadSelfAttention (Grouped Query Attention)

	Implements GQA (Grouped Query Attention) where:
	- Query heads: 9 (full attention)
	- Key/Value heads: 3 (shared across query heads)

	```python
	class MultiHeadSelfAttention(nn.Module):
	def forward(self, x, cos, sin, past_key_value=None, use_cache=False):
	# 1. Project to Q, K, V
	q = self.q_proj(x) # (B, T, n_heads * head_dim)
	k = self.k_proj(x) # (B, T, n_kv_heads * head_dim)
	v = self.v_proj(x) # (B, T, n_kv_heads * head_dim)

	# 2. Apply RoPE to Q and K
	q = apply_rope(q, cos, sin)
	k = apply_rope(k, cos, sin)

	# 3. KV Cache support (for inference)
	if past_key_value:
	k = torch.cat([past_k, k], dim=2)
	v = torch.cat([past_v, v], dim=2)

	# 4. GQA: Expand K/V if needed
	if n_kv_heads < n_heads:
	k = k.repeat_interleave(repeat_factor, dim=1)
	v = v.repeat_interleave(repeat_factor, dim=1)

	# 5. Compute attention scores
	scores = (q @ k.transpose(-2, -1)) / sqrt(head_dim)
	scores = scores + causal_mask # Causal masking

	# 6. Softmax and weighted sum
	probs = F.softmax(scores, dim=-1)
	out = probs @ v

	return out, present_key_value
	```

	Key Features:
	- KV Cache: Efficient inference by caching past key-value pairs
	- GQA: Reduces memory by sharing K/V heads (3:1 ratio)
	- Causal Masking: Prevents attending to future tokens
	- RoPE Integration: Positional encoding via rotary embeddings

	#### 5. SmolMLP (SwiGLU Activation)

	Implements the SwiGLU (Swish-Gated Linear Unit) MLP:

	```python
	class SmolMLP(nn.Module):
	def forward(self, x):
	# fc1 outputs 2 * intermediate_size
	x = self.fc1(x) # (B, T, 2 * 1536) = (B, T, 3072)
	x1, x2 = x.chunk(2, dim=-1) # Split into two parts
	# SwiGLU: SiLU(x1) * x2
	return self.fc2(F.silu(x1) * x2)
	```

	Key Features:
	- SwiGLU: `SiLU(x1) * x2` activation (better than ReLU/GELU)
	- No bias: Following LLaMA architecture
	- Efficient: Single matrix multiplication with split

	#### 6. SmolBlock (Transformer Block)

	Combines attention and MLP with pre-norm and residual connections:

	```python
	class SmolBlock(nn.Module):
	def forward(self, x, cos, sin, past_key_value=None, use_cache=False):
	# Pre-norm attention with residual
	attn_out, present_kv = self.attn(
	self.attn_norm(x), cos, sin,
	past_key_value=past_key_value, use_cache=use_cache
	)
	x = x + attn_out

	# Pre-norm MLP with residual
	x = x + self.mlp(self.mlp_norm(x))

	return x, present_kv
	```

	Architecture:
	- Pre-norm: Normalization before attention/MLP (not after)
	- Residual connections: Skip connections for gradient flow
	- KV Cache passthrough: Supports efficient inference

	#### 7. SmolLM2 (Main Model)

	Top-level model that combines all components:

	```python
	class SmolLM2(nn.Module):
	def __init__(self, config):
	self.embed_tokens = nn.Embedding(vocab_size, hidden_size)
	self.layers = nn.ModuleList([SmolBlock(config) for _ in range(30)])
	self.norm = RMSNorm(hidden_size)
	self.lm_head = nn.Linear(hidden_size, vocab_size, bias=False)

	# Weight tying: share embeddings and output weights
	self.lm_head.weight = self.embed_tokens.weight

	def forward(self, input_ids, past_key_values=None, use_cache=False):
	# 1. Token embeddings
	x = self.embed_tokens(input_ids)

	# 2. Build RoPE cache
	cos, sin = build_rope_cache(...)

	# 3. Pass through transformer layers
	present_key_values = []
	for layer in self.layers:
	x, present_kv = layer(x, cos, sin, past_key_value, use_cache)
	if use_cache:
	present_key_values.append(present_kv)

	# 4. Final norm and language modeling head
	x = self.norm(x)
	logits = self.lm_head(x)

	return logits, present_key_values
	```

	Key Features:
	- Weight Tying: Embeddings and output weights are shared (reduces parameters)
	- KV Cache Support: Full support for efficient autoregressive generation
	- 30 Layers: Deep transformer stack for capacity

	#### 8. Generate Method (Text Generation)

	Autoregressive text generation with KV cache:

	```python
	@torch.no_grad()
	def generate(self, input_ids, max_new_tokens=100, temperature=1.0,
	top_k=None, top_p=None, eos_token_id=None):
	generated = input_ids
	past_key_values = None

	for _ in range(max_new_tokens):
	# Forward pass with KV cache
	logits, past_key_values = self.forward(
	generated[:, -1:] if past_key_values else generated,
	past_key_values=past_key_values,
	use_cache=True
	)

	# Sample next token with temperature, top-k, top-p
	next_token_logits = logits[:, -1, :] / temperature
	# Apply top-k and top-p filtering
	probs = F.softmax(next_token_logits, dim=-1)
	next_token = torch.multinomial(probs, num_samples=1)

	generated = torch.cat([generated, next_token], dim=1)

	if eos_token_id and (next_token == eos_token_id).all():
	break

	return generated
	```

	Key Features:
	- KV Cache: Only processes new tokens (not entire sequence)
	- Sampling: Supports temperature, top-k, and top-p (nucleus) sampling
	- Efficient: O(1) per token after initial forward pass

	### Model Specifications

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Total Parameters \| ~135M \|
	\| Hidden Size \| 576 \|
	\| Layers \| 30 \|
	\| Attention Heads \| 9 (Q), 3 (K/V) \|
	\| Head Dimension \| 64 \|
	\| Intermediate Size \| 1536 \|
	\| Vocabulary Size \| 49,152 \|
	\| Max Sequence Length \| 8,192 \|
	\| RoPE Theta \| 100,000 \|
	\| Activation \| SwiGLU (SiLU-gated) \|
	\| Normalization \| RMSNorm \|
	\| Weight Tying \| Yes (embeddings = output) \|

	### Key Design Choices

	1. GQA (Grouped Query Attention): 3:1 ratio reduces memory by 66% for K/V cache
	2. Pre-norm Architecture: More stable training than post-norm
	3. RMSNorm: Faster and simpler than LayerNorm
	4. RoPE: Relative positional encoding, no learned embeddings
	5. SwiGLU: Better activation than ReLU/GELU
	6. Weight Tying: Reduces parameters and improves generalization
	7. No Biases: Following LLaMA, reduces parameters slightly

	### Usage Example

	```python
	from model import SmolConfig, SmolLM2
	from transformers import AutoConfig

	# Load config from HuggingFace
	hf_config = AutoConfig.from_pretrained("HuggingFaceTB/SmolLM2-135M")
	config = SmolConfig.from_hf(hf_config)

	# Create model
	model = SmolLM2(config)

	# Forward pass (training)
	input_ids = torch.randint(0, config.vocab_size, (2, 512))
	logits, _ = model(input_ids, use_cache=False)

	# Text generation (inference with KV cache)
	prompt_ids = tokenizer.encode("Hello, how are you?")
	generated = model.generate(
	prompt_ids,
	max_new_tokens=100,
	temperature=0.8,
	top_k=50
	)
	```

	## Training

	See `README_TRAINING.md` for detailed training instructions.

	## Inference

	See `app.py` for the Gradio web interface or use the `generate()` method directly.

	## References

	- [SmolLM2 Paper](https://arxiv.org/abs/2406.02528)
	- [LLaMA Architecture](https://arxiv.org/abs/2302.13971)
	- [RoPE: Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
	- [SwiGLU Activation](https://arxiv.org/abs/2002.05202)