pranikz commited on
Commit
8c8e16e
Β·
verified Β·
1 Parent(s): e0c1c1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +254 -3
README.md CHANGED
@@ -1,3 +1,254 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ # πŸš€ Small Language Model (SLM) from Scratch β€” Explained
5
+
6
+ This notebook builds, trains, and runs a **small Transformer-based language model (mini GPT)** on a movie scripts dataset.
7
+ Written for someone who knows **basic ML/DL** but is new to **LLMs**.
8
+
9
+ ---
10
+
11
+ ## 1. Dataset & Preprocessing
12
+
13
+ ```python
14
+ from datasets import load_dataset
15
+ import tiktoken, numpy as np
16
+
17
+ # Load dataset
18
+ ds = load_dataset("IsmaelMousa/movies")
19
+
20
+ # Split into train/val
21
+ ds = ds['train'].train_test_split(test_size=0.1, seed=42)
22
+
23
+ # Tokenizer (GPT-2)
24
+ enc = tiktoken.get_encoding("gpt2")
25
+
26
+ def process(example):
27
+ ids = enc.encode_ordinary(example['Script'])
28
+ return {'ids': ids, 'len': len(ids)}
29
+
30
+ # Tokenize
31
+ tokenized = ds.map(process, remove_columns=['Name','Script'])
32
+ ```
33
+
34
+ πŸ”Ή Dataset = movie scripts β†’ tokenized into IDs β†’ saved in `.bin` files for fast training.
35
+
36
+ ---
37
+
38
+ ## 2. Create Input-Output Batches
39
+
40
+ The model trains on fixed-length chunks (`block_size`) of tokens.
41
+ Each batch contains input `X` and target `Y` sequences, where `Y` is shifted by 1 (next-token labels).
42
+
43
+ ```python
44
+ def get_batch(split):
45
+ data = train_data if split == 'train' else val_data
46
+ ix = torch.randint(len(data) - block_size, (batch_size,))
47
+ x = torch.stack([torch.from_numpy(data[i:i+block_size].astype(np.int64)) for i in ix])
48
+ y = torch.stack([torch.from_numpy(data[i+1:i+block_size+1].astype(np.int64)) for i in ix])
49
+ return x.to(device), y.to(device)
50
+ ```
51
+
52
+ πŸ”Ή This is how we feed training data: **chunks of movie script β†’ model learns to predict next token**.
53
+
54
+ ---
55
+
56
+ ## 3. Model Architecture
57
+
58
+ The model is a **stack of Transformer blocks**, similar to GPT-2.
59
+
60
+ ### (a) LayerNorm
61
+ ```python
62
+ class LayerNorm(nn.Module):
63
+ def __init__(self, ndim, bias):
64
+ super().__init__()
65
+ self.weight = nn.Parameter(torch.ones(ndim))
66
+ self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
67
+ def forward(self, x):
68
+ return F.layer_norm(x, self.weight.shape, self.weight, self.bias, 1e-5)
69
+ ```
70
+ - Normalizes features β†’ stabilizes training.
71
+ - Like BatchNorm, but per token, not per batch.
72
+
73
+ ---
74
+
75
+ ### (b) Causal Self-Attention
76
+ ```python
77
+ class CausalSelfAttention(nn.Module):
78
+ def __init__(self, config):
79
+ super().__init__()
80
+ self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias) # QKV
81
+ self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
82
+ self.n_head = config.n_head
83
+ self.n_embd = config.n_embd
84
+
85
+ def forward(self, x):
86
+ B, T, C = x.size()
87
+ q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
88
+ # Reshape into multi-heads
89
+ k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
90
+ q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
91
+ v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
92
+
93
+ # Masked self-attention (causal: no peeking forward)
94
+ att = (q @ k.transpose(-2, -1)) / (C // self.n_head)**0.5
95
+ mask = torch.tril(torch.ones(T, T, device=x.device))
96
+ att = att.masked_fill(mask == 0, float('-inf'))
97
+ att = F.softmax(att, dim=-1)
98
+ y = att @ v
99
+
100
+ # Recombine heads
101
+ y = y.transpose(1, 2).contiguous().view(B, T, C)
102
+ return self.c_proj(y)
103
+ ```
104
+ - Lets each token "attend" to previous tokens.
105
+ - Causal masking ensures left-to-right generation.
106
+
107
+ ---
108
+
109
+ ### (c) MLP
110
+ ```python
111
+ class MLP(nn.Module):
112
+ def __init__(self, config):
113
+ super().__init__()
114
+ self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
115
+ self.gelu = nn.GELU()
116
+ self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
117
+
118
+ def forward(self, x):
119
+ return self.c_proj(self.gelu(self.c_fc(x)))
120
+ ```
121
+ - Expands hidden dim by 4x, then projects back.
122
+ - Adds non-linear transformation.
123
+
124
+ ---
125
+
126
+ ### (d) Transformer Block
127
+ ```python
128
+ class Block(nn.Module):
129
+ def __init__(self, config):
130
+ super().__init__()
131
+ self.ln1 = LayerNorm(config.n_embd, config.bias)
132
+ self.attn = CausalSelfAttention(config)
133
+ self.ln2 = LayerNorm(config.n_embd, config.bias)
134
+ self.mlp = MLP(config)
135
+
136
+ def forward(self, x):
137
+ x = x + self.attn(self.ln1(x)) # Residual
138
+ x = x + self.mlp(self.ln2(x)) # Residual
139
+ return x
140
+ ```
141
+ - Core Transformer block = `[Norm β†’ Attention β†’ Residual β†’ Norm β†’ MLP β†’ Residual]`.
142
+
143
+ ---
144
+
145
+ ### (e) GPT Model
146
+ ```python
147
+ class GPT(nn.Module):
148
+ def __init__(self, config):
149
+ super().__init__()
150
+ self.transformer = nn.ModuleDict(dict(
151
+ wte = nn.Embedding(config.vocab_size, config.n_embd), # token embedding
152
+ wpe = nn.Embedding(config.block_size, config.n_embd), # position embedding
153
+ h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
154
+ ln_f = LayerNorm(config.n_embd, config.bias),
155
+ ))
156
+ self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
157
+ self.transformer.wte.weight = self.lm_head.weight # weight tying
158
+
159
+ def forward(self, idx, targets=None):
160
+ b, t = idx.size()
161
+ tok_emb = self.transformer.wte(idx)
162
+ pos_emb = self.transformer.wpe(torch.arange(0, t, device=idx.device))
163
+ x = tok_emb + pos_emb
164
+ for block in self.transformer.h:
165
+ x = block(x)
166
+ x = self.transformer.ln_f(x)
167
+ logits = self.lm_head(x)
168
+
169
+ if targets is None:
170
+ return logits, None
171
+ loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
172
+ return logits, loss
173
+ ```
174
+ - Input tokens β†’ embeddings + positional encoding β†’ Transformer blocks β†’ logits over vocab.
175
+ - If `targets` provided β†’ compute cross-entropy loss.
176
+ - Otherwise β†’ just output logits for generation.
177
+
178
+ ---
179
+
180
+ ### (f) Generation
181
+ ```python
182
+ @torch.no_grad()
183
+ def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
184
+ for _ in range(max_new_tokens):
185
+ idx_cond = idx[:, -self.config.block_size:]
186
+ logits, _ = self(idx_cond)
187
+ logits = logits[:, -1, :] / temperature
188
+ if top_k is not None:
189
+ v, _ = torch.topk(logits, top_k)
190
+ logits[logits < v[:, [-1]]] = -float('Inf')
191
+ probs = F.softmax(logits, dim=-1)
192
+ idx_next = torch.multinomial(probs, num_samples=1)
193
+ idx = torch.cat((idx, idx_next), dim=1)
194
+ return idx
195
+ ```
196
+ - Autoregressively generates tokens.
197
+ - Uses `temperature` (randomness) and `top_k` (restricts to top-k likely tokens).
198
+
199
+ ---
200
+
201
+ ## 4. Training
202
+
203
+ - **Loss**: Cross-Entropy (predict next token).
204
+ - **Optimizer**: AdamW (with tuned betas, weight decay).
205
+ - **Scheduler**: Warmup + Cosine Decay.
206
+ - **Mixed Precision + Gradient Accumulation** for efficiency.
207
+
208
+ ---
209
+
210
+ ## 5. Monitoring
211
+
212
+ ```python
213
+ plt.plot(train_loss_list, 'g', label='train_loss')
214
+ plt.plot(validation_loss_list, 'r', label='validation_loss')
215
+ plt.xlabel("Steps - Every 100 epochs")
216
+ plt.ylabel("Loss")
217
+ plt.legend()
218
+ plt.show()
219
+ ```
220
+ - Green = training loss, Red = validation loss.
221
+ - Watch for overfitting / underfitting.
222
+
223
+ ---
224
+
225
+ ## 6. Inference
226
+
227
+ ```python
228
+ # Load best model
229
+ model = GPT(config)
230
+ model.load_state_dict(torch.load("best_model_params.pt", map_location=device))
231
+ model.eval()
232
+
233
+ # Prompt
234
+ sentence = "Write a Tarantino-style diner scene with two strangers..."
235
+ context = torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(0).to(device)
236
+
237
+ # Generate (recommended shorter length)
238
+ y = model.generate(context, max_new_tokens=300, temperature=0.8, top_k=50)
239
+ print(enc.decode(y[0].tolist()))
240
+ ```
241
+
242
+ ⚠️ Note: In the notebook, `max_new_tokens=5000` was used, which may be excessive.
243
+ For practical testing, use **200–500 tokens**.
244
+
245
+ ---
246
+
247
+ ## βœ… Summary
248
+
249
+ - **Architecture**: GPT-like Transformer (attention + MLP blocks).
250
+ - **Training**: Next-token prediction with AdamW + LR scheduling.
251
+ - **Evaluation**: Loss curves (train vs val).
252
+ - **Inference**: Autoregressive generation with temperature & top-k control.
253
+
254
+ This is essentially a **mini GPT-2 clone**, scaled down for small datasets like movie scripts.