quanghuynt14 commited on
Commit
ad4cb5e
Β·
verified Β·
1 Parent(s): f1df881

πŸ“š Complete 20-week study plan: Learn LLMs from First Principles

Browse files
Files changed (1) hide show
  1. README.md +980 -0
README.md ADDED
@@ -0,0 +1,980 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: LLM from First Principles
3
+ emoji: 🧠
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: static
7
+ pinned: true
8
+ tags:
9
+ - llm
10
+ - tutorial
11
+ - learning-path
12
+ - transformers
13
+ - from-scratch
14
+ ---
15
+
16
+ # 🧠 Learn LLMs from First Principles β€” A Complete Study Plan
17
+
18
+ > **Goal:** Go from zero to confidently understanding, building, fine-tuning, aligning, and deploying Large Language Models.
19
+ >
20
+ > **Time commitment:** ~20 weeks (10–15 hrs/week) β€” flexible, self-paced.
21
+ >
22
+ > **Prerequisites:** Python proficiency. Basic calculus & linear algebra (or willingness to learn in Week 1).
23
+
24
+ ---
25
+
26
+ ## Table of Contents
27
+
28
+ - [Overview & Philosophy](#overview--philosophy)
29
+ - [Phase 1: Mathematical & Neural Network Foundations (Weeks 1–3)](#phase-1-mathematical--neural-network-foundations-weeks-13)
30
+ - [Phase 2: The Transformer Architecture from Scratch (Weeks 4–6)](#phase-2-the-transformer-architecture-from-scratch-weeks-46)
31
+ - [Phase 3: Language Modeling & Pretraining (Weeks 7–9)](#phase-3-language-modeling--pretraining-weeks-79)
32
+ - [Phase 4: The Hugging Face Ecosystem (Weeks 10–12)](#phase-4-the-hugging-face-ecosystem-weeks-1012)
33
+ - [Phase 5: Fine-Tuning & Alignment (Weeks 13–16)](#phase-5-fine-tuning--alignment-weeks-1316)
34
+ - [Phase 6: Advanced Topics & Capstone (Weeks 17–20)](#phase-6-advanced-topics--capstone-weeks-1720)
35
+ - [Reading List: Landmark Papers](#reading-list-landmark-papers)
36
+ - [All Resources & Links](#all-resources--links)
37
+ - [Appendix: Glossary of Key Terms](#appendix-glossary-of-key-terms)
38
+
39
+ ---
40
+
41
+ ## Overview & Philosophy
42
+
43
+ This plan is built on one principle: **understand by building**. Rather than memorizing APIs, you will:
44
+
45
+ 1. **Derive** the math (attention, backpropagation, loss functions)
46
+ 2. **Implement** each concept in raw PyTorch before using libraries
47
+ 3. **Train** real models on real data
48
+ 4. **Read** the original papers that introduced each idea
49
+
50
+ The plan follows a bottom-up progression:
51
+
52
+ ```
53
+ Math Foundations ──► Neural Networks ──► Transformers ──► Language Models
54
+ β”‚ β”‚ β”‚ β”‚
55
+ β–Ό β–Ό β–Ό β–Ό
56
+ Linear Algebra Backprop & Self-Attention GPT-2 from
57
+ Calculus Gradient Descent Multi-Head Attn Scratch
58
+ Probability MLP, RNN, LSTM LayerNorm, FFN Tokenization
59
+ Pos. Encoding Pretraining
60
+
61
+ ──► Fine-Tuning ──► Alignment ──► Reasoning ──► Agents & Deployment
62
+ β”‚ β”‚ β”‚ β”‚
63
+ β–Ό β–Ό β–Ό β–Ό
64
+ SFT, LoRA RLHF, DPO GRPO, R1 Tool-use, RAG
65
+ Chat Templates Reward Models Chain-of-Thought Deployment
66
+ PEFT PPO Open R1 Quantization
67
+ ```
68
+
69
+ ---
70
+
71
+ ## Phase 1: Mathematical & Neural Network Foundations (Weeks 1–3)
72
+
73
+ ### 🎯 Goal
74
+ Build intuition for the math that powers every neural network, then implement a neural network from scratch.
75
+
76
+ ---
77
+
78
+ ### Week 1: Linear Algebra, Calculus & Probability
79
+
80
+ > *"If you don't understand the math, you're just calling functions."*
81
+
82
+ #### Study Material
83
+
84
+ | Resource | Topic | Time | Link |
85
+ |----------|-------|------|------|
86
+ | 3Blue1Brown: Essence of Linear Algebra | Vectors, matrices, transformations, eigenvectors | ~3.5 hrs (16 videos) | [YouTube Playlist](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) |
87
+ | 3Blue1Brown: Essence of Calculus | Derivatives, integrals, chain rule | ~3 hrs (12 videos) | [YouTube Playlist](https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr) |
88
+ | StatQuest: Probability Fundamentals | Bayes theorem, distributions, MLE | ~2 hrs | [YouTube Channel](https://www.youtube.com/@statquest) |
89
+
90
+ #### Key Concepts to Master
91
+
92
+ - [ ] Matrix multiplication β€” what it *means* geometrically (not just how to compute it)
93
+ - [ ] Dot product β€” as projection and similarity measure
94
+ - [ ] Softmax function β€” converting raw scores to probabilities
95
+ - [ ] Chain rule β€” the foundation of backpropagation
96
+ - [ ] Cross-entropy loss β€” measuring how wrong a probability distribution is
97
+ - [ ] Maximum likelihood estimation β€” why we minimize negative log-likelihood
98
+
99
+ #### πŸ“ Exercise
100
+ Implement these in Python (no NumPy):
101
+ ```python
102
+ def matmul(A, B): ... # Matrix multiplication from scratch
103
+ def softmax(x): ... # Softmax function
104
+ def cross_entropy(pred, target): ... # Cross-entropy loss
105
+ ```
106
+
107
+ ---
108
+
109
+ ### Week 2: Neural Networks from Scratch
110
+
111
+ #### Study Material
112
+
113
+ | Resource | Topic | Time | Link |
114
+ |----------|-------|------|------|
115
+ | 3Blue1Brown: Neural Networks | What is a neural network, gradient descent, backprop | ~1 hr (4 videos) | [YouTube Playlist](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) |
116
+ | 3Blue1Brown: Attention in Transformers | Visual explanation of attention mechanism | ~45 min | [YouTube](https://www.youtube.com/watch?v=eMlx5fFNoYc) |
117
+ | Karpathy: The spelled-out intro to neural networks and backpropagation | Build micrograd β€” a tiny autograd engine | ~2.5 hrs | [YouTube](https://www.youtube.com/watch?v=VMj-3S1tku0) |
118
+
119
+ #### Key Concepts to Master
120
+
121
+ - [ ] Forward pass β€” computing outputs layer by layer
122
+ - [ ] Loss function β€” quantifying prediction error
123
+ - [ ] Backward pass (backpropagation) β€” computing gradients via chain rule
124
+ - [ ] Gradient descent β€” updating weights to minimize loss
125
+ - [ ] Learning rate β€” step size for weight updates
126
+ - [ ] Computational graph β€” tracking operations for automatic differentiation
127
+
128
+ #### πŸ”¨ Project: Build micrograd
129
+ Follow Karpathy's video and build a complete autograd engine from scratch:
130
+ - `Value` class with `__add__`, `__mul__`, `__pow__`
131
+ - Automatic gradient computation via `.backward()`
132
+ - Train a simple MLP on a toy dataset
133
+ - **Repo:** [github.com/karpathy/micrograd](https://github.com/karpathy/micrograd)
134
+
135
+ ---
136
+
137
+ ### Week 3: PyTorch Fundamentals & MLPs
138
+
139
+ #### Study Material
140
+
141
+ | Resource | Topic | Time | Link |
142
+ |----------|-------|------|------|
143
+ | PyTorch Official: 60-min Blitz | Tensors, autograd, nn.Module | ~2 hrs | [pytorch.org/tutorials](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) |
144
+ | fast.ai Lesson 1 | Practical deep learning, training loops | ~2 hrs | [course.fast.ai](https://course.fast.ai/) |
145
+ | Karpathy: makemore Part 1 (Bigrams) | Character-level language model, counting | ~1.5 hrs | [YouTube](https://www.youtube.com/watch?v=PaCmpygFfXo) |
146
+ | Karpathy: makemore Part 2 (MLP) | MLP language model (Bengio et al., 2003) | ~1.5 hrs | [YouTube](https://www.youtube.com/watch?v=TCH_1BHY58I) |
147
+
148
+ #### Key Concepts to Master
149
+
150
+ - [ ] `torch.Tensor` β€” creation, indexing, broadcasting, GPU transfer
151
+ - [ ] `torch.nn.Module` β€” defining layers, forward method
152
+ - [ ] `torch.optim` β€” SGD, Adam
153
+ - [ ] `loss.backward()` β€” automatic differentiation
154
+ - [ ] Training loop: forward β†’ loss β†’ backward β†’ step β†’ zero_grad
155
+ - [ ] Bigram model β€” simplest language model (predicting next character from previous one)
156
+ - [ ] MLP language model β€” predicting next character from a window of previous characters
157
+
158
+ #### πŸ”¨ Project: Character-level Language Model
159
+ Follow makemore Parts 1 & 2:
160
+ 1. Build a bigram model (lookup table / counting approach)
161
+ 2. Build an MLP model that takes N characters and predicts the next
162
+ 3. Train on a names dataset and generate new names
163
+ - **Repo:** [github.com/karpathy/makemore](https://github.com/karpathy/makemore)
164
+
165
+ ---
166
+
167
+ ## Phase 2: The Transformer Architecture from Scratch (Weeks 4–6)
168
+
169
+ ### 🎯 Goal
170
+ Understand every component of the Transformer and implement it from scratch in PyTorch.
171
+
172
+ ---
173
+
174
+ ### Week 4: Deep Dive into Training Dynamics
175
+
176
+ #### Study Material
177
+
178
+ | Resource | Topic | Time | Link |
179
+ |----------|-------|------|------|
180
+ | Karpathy: makemore Part 3 | Batch normalization, weight initialization, activation statistics | ~1.5 hrs | [YouTube](https://www.youtube.com/watch?v=P6sfmUTpUmc) |
181
+ | Karpathy: makemore Part 4 | Becoming a backprop ninja β€” manual backpropagation | ~1.5 hrs | [YouTube](https://www.youtube.com/watch?v=q8SA3rM6ckI) |
182
+ | Karpathy: makemore Part 5 | WaveNet-style deep networks, dilated causal convolutions | ~1 hr | [YouTube](https://www.youtube.com/watch?v=t3YJ5hKiMQ0) |
183
+
184
+ #### Key Concepts to Master
185
+
186
+ - [ ] Batch normalization β€” why it stabilizes training, running mean/var
187
+ - [ ] Weight initialization β€” Kaiming/He init, why it matters
188
+ - [ ] Activation statistics β€” dead neurons, saturation, vanishing gradients
189
+ - [ ] Manual backpropagation β€” computing gradients by hand through every operation
190
+ - [ ] Residual connections β€” why `x + f(x)` helps deep networks
191
+ - [ ] Hierarchical models β€” building deeper architectures
192
+
193
+ #### πŸ“ Exercise
194
+ Take your makemore MLP and manually compute the gradient of every parameter by hand (no `.backward()`). Verify against PyTorch autograd.
195
+
196
+ ---
197
+
198
+ ### Week 5: The Transformer β€” Theory & Paper Reading
199
+
200
+ #### Study Material
201
+
202
+ | Resource | Topic | Time | Link |
203
+ |----------|-------|------|------|
204
+ | **Paper: "Attention Is All You Need"** | The original Transformer paper | ~3 hrs (multiple readings) | [arxiv:1706.03762](https://arxiv.org/abs/1706.03762) |
205
+ | 3Blue1Brown: Attention in Transformers | Visual explanation of QKV, multi-head attention | ~45 min | [YouTube](https://www.youtube.com/watch?v=eMlx5fFNoYc) |
206
+ | Jay Alammar: The Illustrated Transformer | Step-by-step visual walkthrough | ~1.5 hrs | [jalammar.github.io](https://jalammar.github.io/illustrated-transformer/) |
207
+ | Lilian Weng: "Attention? Attention!" | Comprehensive attention mechanisms survey | ~2 hrs | [lilianweng.github.io](https://lilianweng.github.io/posts/2018-06-24-attention/) |
208
+
209
+ #### Key Sections of "Attention Is All You Need" to Study
210
+
211
+ | Section | What to Learn | Core Formula |
212
+ |---------|---------------|--------------|
213
+ | Β§3.1 Encoder & Decoder Stacks | N=6 layers, residual + LayerNorm | `LayerNorm(x + Sublayer(x))` |
214
+ | §3.2.1 Scaled Dot-Product Attention | The heart of everything | `Attention(Q,K,V) = softmax(QK^T / √d_k) V` |
215
+ | Β§3.2.2 Multi-Head Attention | Parallel attention in subspaces | `MultiHead = Concat(head_1..h) W^O` |
216
+ | Β§3.3 Position-wise FFN | Two-layer MLP per position | `FFN(x) = max(0, xW₁+b₁)Wβ‚‚+bβ‚‚` |
217
+ | Β§3.5 Positional Encoding | Sinusoidal position signals | `PE(pos,2i) = sin(pos/10000^(2i/d))` |
218
+ | Β§4 Why Self-Attention | O(1) sequential ops vs O(n) for RNN | Complexity comparison table |
219
+ | Β§5.3 Optimizer | Learning rate warmup schedule | `lr = d^(-0.5) Β· min(step^(-0.5), stepΒ·warmup^(-1.5))` |
220
+
221
+ #### Key Concepts to Master
222
+
223
+ - [ ] Query, Key, Value β€” what each represents intuitively
224
+ - [ ] Scaled dot-product attention β€” why we scale by `√d_k`
225
+ - [ ] Multi-head attention β€” why multiple heads are better than one
226
+ - [ ] Causal masking β€” preventing the model from "seeing the future"
227
+ - [ ] Positional encoding β€” how the model knows word order
228
+ - [ ] Layer normalization β€” stabilizing training in transformers
229
+ - [ ] Encoder vs. Decoder vs. Encoder-Decoder architectures
230
+ - [ ] Residual connections β€” enabling gradient flow in deep networks
231
+
232
+ #### πŸ“ Exercise
233
+ Draw the complete Transformer architecture from memory, labeling every component with its dimensions (for d_model=512, h=8, d_k=64, d_ff=2048, N=6).
234
+
235
+ ---
236
+
237
+ ### Week 6: Build GPT from Scratch
238
+
239
+ > **This is the most important week in the entire plan.**
240
+
241
+ #### Study Material
242
+
243
+ | Resource | Topic | Time | Link |
244
+ |----------|-------|------|------|
245
+ | **Karpathy: "Let's Build GPT: from scratch, in code, spelled out"** | Complete GPT implementation in PyTorch | ~2 hrs | [YouTube](https://youtu.be/kCc8FmEb1nY) |
246
+ | nanoGPT `model.py` | Production-quality GPT implementation (~300 lines) | ~3 hrs (read line by line) | [GitHub](https://github.com/karpathy/nanoGPT/blob/master/model.py) |
247
+ | nanoGPT `train.py` | Training loop with DDP, AMP, gradient accumulation | ~2 hrs | [GitHub](https://github.com/karpathy/nanoGPT/blob/master/train.py) |
248
+
249
+ #### What You'll Build (following the video)
250
+
251
+ ```
252
+ Input Text
253
+ ↓
254
+ [Token Embedding] + [Position Embedding]
255
+ ↓
256
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
257
+ β”‚ Transformer Block (Γ—N) β”‚
258
+ β”‚ β”œβ”€β”€ LayerNorm β”‚
259
+ β”‚ β”œβ”€β”€ Multi-Head Self-Attention β”‚
260
+ β”‚ β”‚ β”œβ”€β”€ Q = x @ W_q β”‚
261
+ β”‚ β”‚ β”œβ”€β”€ K = x @ W_k β”‚
262
+ β”‚ β”‚ β”œβ”€β”€ V = x @ W_v β”‚
263
+ β”‚ β”‚ β”œβ”€β”€ Causal Mask β”‚
264
+ β”‚ β”‚ └── softmax(QK^T/√d) @ V β”‚
265
+ β”‚ β”œβ”€β”€ Residual Connection β”‚
266
+ β”‚ β”œβ”€β”€ LayerNorm β”‚
267
+ β”‚ β”œβ”€β”€ Feed-Forward (MLP) β”‚
268
+ β”‚ β”‚ β”œβ”€β”€ Linear β†’ GELU β†’ Linearβ”‚
269
+ β”‚ └── Residual Connection β”‚
270
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
271
+ ↓
272
+ [LayerNorm]
273
+ ↓
274
+ [Linear β†’ Logits]
275
+ ↓
276
+ [Softmax β†’ Next Token Probabilities]
277
+ ```
278
+
279
+ #### Key Concepts to Master
280
+
281
+ - [ ] Token embedding table β€” mapping integers to vectors
282
+ - [ ] Position embedding β€” learned (GPT-2 style) vs sinusoidal
283
+ - [ ] Self-attention with causal mask β€” the `tril` trick
284
+ - [ ] Multi-head attention β€” splitting d_model into h heads
285
+ - [ ] Feed-forward network β€” expanding and contracting dimensions
286
+ - [ ] Residual stream β€” the "highway" through the network
287
+ - [ ] Pre-norm vs post-norm β€” GPT-2 uses pre-norm
288
+ - [ ] Weight tying β€” sharing embedding and output projection weights
289
+ - [ ] Temperature and top-k sampling β€” controlling text generation
290
+
291
+ #### πŸ”¨ Project: Your Own nanoGPT
292
+ 1. Watch the video, pause, and implement each component yourself
293
+ 2. Train a character-level GPT on Shakespeare (~1M characters)
294
+ 3. Generate text and observe how quality improves with training
295
+ 4. Experiment: change number of heads, layers, embedding dimension
296
+ 5. **Deliverable:** A working GPT that generates coherent Shakespeare text
297
+
298
+ ---
299
+
300
+ ## Phase 3: Language Modeling & Pretraining (Weeks 7–9)
301
+
302
+ ### 🎯 Goal
303
+ Understand tokenization, scaling laws, and the full pretraining pipeline.
304
+
305
+ ---
306
+
307
+ ### Week 7: Tokenization β€” The Unsung Hero
308
+
309
+ #### Study Material
310
+
311
+ | Resource | Topic | Time | Link |
312
+ |----------|-------|------|------|
313
+ | Karpathy: "Let's build the GPT Tokenizer" | BPE from scratch, tiktoken, sentencepiece | ~2 hrs | [YouTube](https://www.youtube.com/watch?v=zduSFxRajkE) |
314
+ | HF NLP Course: Chapter 6 β€” Tokenizers | BPE, WordPiece, Unigram, `πŸ€— Tokenizers` library | ~3 hrs | [HF Course Ch.6](https://huggingface.co/learn/nlp-course/chapter6/1) |
315
+ | **Paper: Sennrich et al. (2016)** | Original BPE for NMT | ~1 hr | [arxiv:1508.07909](https://arxiv.org/abs/1508.07909) |
316
+
317
+ #### Key Concepts to Master
318
+
319
+ - [ ] Why tokenization matters β€” it defines what the model "sees"
320
+ - [ ] Character-level vs word-level vs subword tokenization
321
+ - [ ] Byte Pair Encoding (BPE) β€” the greedy merge algorithm
322
+ - [ ] WordPiece β€” BPE variant used by BERT
323
+ - [ ] Unigram β€” probabilistic tokenizer used by T5
324
+ - [ ] SentencePiece β€” language-agnostic tokenization
325
+ - [ ] Special tokens β€” `<bos>`, `<eos>`, `<pad>`, `<unk>`
326
+ - [ ] Vocabulary size tradeoffs β€” larger vocab = shorter sequences but bigger embedding table
327
+ - [ ] The `πŸ€— Tokenizers` library β€” training custom tokenizers in Rust-speed
328
+
329
+ #### πŸ”¨ Project: Train Your Own BPE Tokenizer
330
+ 1. Implement BPE from scratch in Python (follow Karpathy)
331
+ 2. Train a `πŸ€— Tokenizers` BPE tokenizer on a custom corpus
332
+ 3. Compare vocabulary sizes: 1K, 10K, 50K β€” observe token lengths
333
+ 4. Decode some tokens and study what subwords the model learned
334
+
335
+ ---
336
+
337
+ ### Week 8: Scaling Laws & Pretraining Concepts
338
+
339
+ #### Study Material
340
+
341
+ | Resource | Topic | Time | Link |
342
+ |----------|-------|------|------|
343
+ | **Paper: "Scaling Laws for Neural Language Models" (Kaplan et al.)** | How loss scales with model size, data, compute | ~2 hrs | [arxiv:2001.08361](https://arxiv.org/abs/2001.08361) |
344
+ | **Paper: "Training Compute-Optimal LLMs" (Chinchilla)** | Optimal ratio of model size to training tokens | ~2 hrs | [arxiv:2203.15556](https://arxiv.org/abs/2203.15556) |
345
+ | **Paper: "Language Models are Few-Shot Learners" (GPT-3)** | In-context learning, few-shot prompting | ~3 hrs | [arxiv:2005.14165](https://arxiv.org/abs/2005.14165) |
346
+ | Karpathy: "Reproducing GPT-2 (124M)" | Full pretraining pipeline on OpenWebText | ~4 hrs | [YouTube](https://www.youtube.com/watch?v=l8pRSuU81PU) |
347
+
348
+ #### Key Concepts to Master
349
+
350
+ - [ ] Scaling laws β€” `L(N) ∝ N^(-Ξ±)` β€” loss decreases as a power law of model size
351
+ - [ ] Chinchilla-optimal β€” train N parameters on ~20N tokens
352
+ - [ ] Compute-optimal training β€” how to budget FLOPs between model size and data
353
+ - [ ] In-context learning β€” how large models learn from examples in the prompt
354
+ - [ ] Emergent abilities β€” capabilities that appear only at scale
355
+ - [ ] Data quality vs quantity β€” why curated data beats raw web scrapes
356
+ - [ ] Mixed-precision training (fp16/bf16) β€” training faster with lower precision
357
+ - [ ] Gradient accumulation β€” simulating large batches on small GPUs
358
+ - [ ] Distributed Data Parallel (DDP) β€” training across multiple GPUs
359
+
360
+ #### πŸ“ Exercise
361
+ Calculate:
362
+ 1. How many FLOPs to train a 1B parameter model on 20B tokens?
363
+ - Rule of thumb: `FLOPs β‰ˆ 6 Γ— N Γ— D` (N=params, D=tokens)
364
+ 2. If you have 8Γ— A100 GPUs at 312 TFLOPS each, how long would training take?
365
+ 3. What's the Chinchilla-optimal model size for a 1T token dataset?
366
+
367
+ ---
368
+
369
+ ### Week 9: Pretraining a Small Model End-to-End
370
+
371
+ #### Study Material
372
+
373
+ | Resource | Topic | Time | Link |
374
+ |----------|-------|------|------|
375
+ | HF NLP Course: Chapter 7 β€” Main NLP Tasks | Causal LM pretraining with HF Transformers | ~4 hrs | [HF Course Ch.7](https://huggingface.co/learn/nlp-course/chapter7/1) |
376
+ | nanoGPT README & `data/openwebtext/prepare.py` | Data preparation for pretraining | ~1 hr | [GitHub](https://github.com/karpathy/nanoGPT) |
377
+ | **Paper: SmolLM2 (Allal et al.)** | How to train a good small LM | ~2 hrs | [arxiv:2502.02737](https://arxiv.org/abs/2502.02737) |
378
+
379
+ #### Key Concepts to Master
380
+
381
+ - [ ] Dataset preparation β€” downloading, cleaning, tokenizing, chunking
382
+ - [ ] Data loading β€” efficient batching, shuffling, packing sequences
383
+ - [ ] Learning rate schedule β€” warmup + cosine decay
384
+ - [ ] Loss curves β€” what a healthy training run looks like
385
+ - [ ] Evaluation β€” perplexity, held-out validation loss
386
+ - [ ] Checkpointing β€” saving model state during training
387
+ - [ ] The `Trainer` API β€” HF's high-level training abstraction
388
+
389
+ #### πŸ”¨ Project: Pretrain a 10M–50M Parameter GPT
390
+ Using nanoGPT or HF Transformers:
391
+ 1. Prepare a dataset (TinyStories, OpenWebText subset, or custom)
392
+ 2. Configure a small GPT (6–12 layers, 256–512 dim, 4–8 heads)
393
+ 3. Train for ~10K steps on a single GPU
394
+ 4. Plot the loss curve, compute perplexity
395
+ 5. Generate text samples at different checkpoints β€” watch quality improve
396
+ 6. Push the model to [Hugging Face Hub](https://huggingface.co)
397
+
398
+ ---
399
+
400
+ ## Phase 4: The Hugging Face Ecosystem (Weeks 10–12)
401
+
402
+ ### 🎯 Goal
403
+ Master the tools you'll use daily: Transformers, Datasets, Hub, and Gradio.
404
+
405
+ ---
406
+
407
+ ### Week 10: Transformers Library & the Hub
408
+
409
+ #### Study Material
410
+
411
+ | Resource | Topic | Time | Link |
412
+ |----------|-------|------|------|
413
+ | HF NLP Course: Chapter 1 β€” Transformer Models | `pipeline()`, model architectures, use cases | ~2 hrs | [HF Course Ch.1](https://huggingface.co/learn/nlp-course/chapter1/1) |
414
+ | HF NLP Course: Chapter 2 β€” Using Transformers | `AutoModel`, `AutoTokenizer`, batching | ~3 hrs | [HF Course Ch.2](https://huggingface.co/learn/nlp-course/chapter2/1) |
415
+ | HF NLP Course: Chapter 4 β€” Sharing on the Hub | Model cards, `push_to_hub()`, versioning | ~1.5 hrs | [HF Course Ch.4](https://huggingface.co/learn/nlp-course/chapter4/1) |
416
+
417
+ #### Key Concepts to Master
418
+
419
+ - [ ] `pipeline()` β€” high-level inference in one line
420
+ - [ ] `AutoModel.from_pretrained()` β€” loading any model architecture
421
+ - [ ] `AutoTokenizer.from_pretrained()` β€” loading the matching tokenizer
422
+ - [ ] Encoder models (BERT) vs Decoder models (GPT) vs Encoder-Decoder (T5)
423
+ - [ ] Model cards β€” documenting your models for others
424
+ - [ ] The Hub β€” versioning, repos, community features
425
+
426
+ #### πŸ”¨ Mini-Project
427
+ Load 5 different models from the Hub and run inference:
428
+ 1. Text generation (GPT-2 / SmolLM2)
429
+ 2. Text classification (BERT / DistilBERT)
430
+ 3. Summarization (T5 / BART)
431
+ 4. Translation (MarianMT)
432
+ 5. Fill-mask (BERT)
433
+
434
+ ---
435
+
436
+ ### Week 11: Datasets, Fine-Tuning & Debugging
437
+
438
+ #### Study Material
439
+
440
+ | Resource | Topic | Time | Link |
441
+ |----------|-------|------|------|
442
+ | HF NLP Course: Chapter 3 β€” Fine-Tuning | `Trainer` API, Accelerate, custom training loops | ~4 hrs | [HF Course Ch.3](https://huggingface.co/learn/nlp-course/chapter3/1) |
443
+ | HF NLP Course: Chapter 5 β€” Datasets Library | Loading, streaming, Arrow format, processing | ~3 hrs | [HF Course Ch.5](https://huggingface.co/learn/nlp-course/chapter5/1) |
444
+ | HF NLP Course: Chapter 8 β€” Debugging | Common training failures, community help | ~1.5 hrs | [HF Course Ch.8](https://huggingface.co/learn/nlp-course/chapter8/1) |
445
+
446
+ #### Key Concepts to Master
447
+
448
+ - [ ] `datasets.load_dataset()` β€” loading from Hub, local files, streaming
449
+ - [ ] `dataset.map()` β€” applying transformations efficiently
450
+ - [ ] Apache Arrow format β€” why HF datasets are fast
451
+ - [ ] `Trainer` API β€” `TrainingArguments`, callbacks, logging
452
+ - [ ] Custom training loop with `Accelerate`
453
+ - [ ] Common errors: shape mismatches, tokenizer/model mismatches, OOM
454
+ - [ ] Gradient checkpointing β€” trading compute for memory
455
+
456
+ #### πŸ”¨ Project: Fine-Tune a Text Classifier
457
+ 1. Load a sentiment dataset (e.g., `stanfordnlp/imdb`)
458
+ 2. Tokenize and prepare data with `datasets`
459
+ 3. Fine-tune `distilbert-base-uncased` with `Trainer`
460
+ 4. Evaluate accuracy on test set
461
+ 5. Push the fine-tuned model to the Hub
462
+ 6. Build a Gradio demo (preview of next week)
463
+
464
+ ---
465
+
466
+ ### Week 12: Demos, Data Annotation & the Full Workflow
467
+
468
+ #### Study Material
469
+
470
+ | Resource | Topic | Time | Link |
471
+ |----------|-------|------|------|
472
+ | HF NLP Course: Chapter 9 β€” Gradio | Building interactive ML demos | ~2 hrs | [HF Course Ch.9](https://huggingface.co/learn/nlp-course/chapter9/1) |
473
+ | HF NLP Course: Chapter 10 β€” Data Annotation | Argilla for dataset curation | ~2 hrs | [HF Course Ch.10](https://huggingface.co/learn/nlp-course/chapter10/1) |
474
+ | HF Spaces Documentation | Deploying models as web apps | ~1 hr | [HF Spaces](https://huggingface.co/docs/hub/spaces) |
475
+
476
+ #### Key Concepts to Master
477
+
478
+ - [ ] `gr.Interface()` β€” creating demos with Gradio
479
+ - [ ] `gr.Blocks()` β€” advanced layouts and interactivity
480
+ - [ ] Hugging Face Spaces β€” deploying demos for free
481
+ - [ ] Data annotation with Argilla β€” creating high-quality datasets
482
+ - [ ] Human-in-the-loop workflows β€” iterating on data quality
483
+
484
+ #### πŸ”¨ Project: Deploy a Model on HF Spaces
485
+ 1. Take your fine-tuned classifier from Week 11
486
+ 2. Build a Gradio app with text input β†’ sentiment prediction
487
+ 3. Deploy on Hugging Face Spaces
488
+ 4. Share the link and get feedback
489
+
490
+ ---
491
+
492
+ ## Phase 5: Fine-Tuning & Alignment (Weeks 13–16)
493
+
494
+ ### 🎯 Goal
495
+ Learn the modern LLM training stack: SFT, LoRA, DPO, RLHF, and GRPO.
496
+
497
+ ---
498
+
499
+ ### Week 13: Supervised Fine-Tuning (SFT) & Chat Models
500
+
501
+ #### Study Material
502
+
503
+ | Resource | Topic | Time | Link |
504
+ |----------|-------|------|------|
505
+ | HF NLP Course: Chapter 11 β€” SFT | Chat templates, `SFTTrainer`, evaluation | ~4 hrs | [HF Course Ch.11](https://huggingface.co/learn/nlp-course/chapter11/1) |
506
+ | smol-course: Module 1 β€” Instruction Tuning | Hands-on SFT with SmolLM2 | ~3 hrs | [GitHub](https://github.com/huggingface/smol-course) |
507
+ | **Paper: "Training language models to follow instructions" (InstructGPT)** | The paper that started it all (OpenAI, 2022) | ~2 hrs | [arxiv:2203.02155](https://arxiv.org/abs/2203.02155) |
508
+
509
+ #### Key Concepts to Master
510
+
511
+ - [ ] Chat templates (ChatML format) β€” `system`, `user`, `assistant` roles
512
+ - [ ] The `messages` format β€” structured conversation data
513
+ - [ ] `SFTTrainer` from TRL β€” the standard fine-tuning trainer
514
+ - [ ] Dataset preparation β€” converting raw text to ChatML format
515
+ - [ ] Packing β€” fitting multiple short examples in one sequence
516
+ - [ ] Evaluation β€” loss curves, manual quality checks, benchmarks
517
+ - [ ] Base model β†’ Instruct model transformation
518
+
519
+ #### Dataset Format for SFT
520
+ ```json
521
+ {
522
+ "messages": [
523
+ {"role": "system", "content": "You are a helpful assistant."},
524
+ {"role": "user", "content": "What is the capital of France?"},
525
+ {"role": "assistant", "content": "The capital of France is Paris."}
526
+ ]
527
+ }
528
+ ```
529
+
530
+ #### πŸ”¨ Project: Fine-Tune SmolLM2 into a Chat Model
531
+ 1. Load [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M)
532
+ 2. Prepare data from [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)
533
+ 3. Apply chat template and train with `SFTTrainer`
534
+ 4. Chat with your model and evaluate quality
535
+ 5. Push to Hub as `your-username/SmolLM2-135M-Chat`
536
+
537
+ ---
538
+
539
+ ### Week 14: Parameter-Efficient Fine-Tuning (LoRA / QLoRA)
540
+
541
+ #### Study Material
542
+
543
+ | Resource | Topic | Time | Link |
544
+ |----------|-------|------|------|
545
+ | smol-course: Module 3 β€” PEFT | LoRA, QLoRA, adapter merging | ~3 hrs | [GitHub](https://github.com/huggingface/smol-course) |
546
+ | **Paper: "LoRA: Low-Rank Adaptation of Large Language Models"** | The LoRA method | ~2 hrs | [arxiv:2106.09685](https://arxiv.org/abs/2106.09685) β€” [HF Paper](https://hf.co/papers/2106.09685) |
547
+ | HF PEFT Documentation | `LoraConfig`, `get_peft_model`, merging | ~2 hrs | [HF PEFT Docs](https://huggingface.co/docs/peft) |
548
+
549
+ #### Key Concepts to Master
550
+
551
+ - [ ] Why full fine-tuning is expensive β€” every parameter needs gradients + optimizer states
552
+ - [ ] Low-rank decomposition β€” `W + Ξ”W = W + BA` where B is (dΓ—r) and A is (rΓ—d)
553
+ - [ ] Rank `r` β€” the bottleneck dimension (typical: 8, 16, 32, 64)
554
+ - [ ] `lora_alpha` β€” scaling factor for LoRA updates
555
+ - [ ] Target modules β€” which layers to add LoRA to (`q_proj`, `v_proj`, `k_proj`, etc.)
556
+ - [ ] QLoRA β€” LoRA on 4-bit quantized models (fits 7B models on consumer GPUs!)
557
+ - [ ] Adapter merging β€” combining LoRA weights back into the base model
558
+ - [ ] Memory savings β€” LoRA trains ~0.1–1% of parameters
559
+
560
+ #### The Math of LoRA
561
+ ```
562
+ Original: h = Wx (d Γ— d matrix, dΒ² parameters)
563
+ LoRA: h = Wx + BAx (B: dΓ—r, A: rΓ—d, only 2dr parameters)
564
+ With r=16, d=4096: 131K vs 16.7M params (125Γ— reduction)
565
+ ```
566
+
567
+ #### πŸ”¨ Project: QLoRA Fine-Tune a 1.7B Model on Consumer GPU
568
+ 1. Load [HuggingFaceTB/SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B) in 4-bit
569
+ 2. Add LoRA adapters with `peft`
570
+ 3. Train on a domain-specific dataset (e.g., coding, medical, legal)
571
+ 4. Compare generation quality: base vs LoRA-tuned
572
+ 5. Merge adapters and push the merged model to Hub
573
+
574
+ ---
575
+
576
+ ### Week 15: Preference Alignment β€” DPO & RLHF
577
+
578
+ #### Study Material
579
+
580
+ | Resource | Topic | Time | Link |
581
+ |----------|-------|------|------|
582
+ | smol-course: Module 2 β€” Preference Alignment | DPO training hands-on | ~3 hrs | [GitHub](https://github.com/huggingface/smol-course) |
583
+ | **Paper: "Direct Preference Optimization" (Rafailov et al.)** | DPO β€” RLHF without a reward model | ~2 hrs | [arxiv:2305.18290](https://arxiv.org/abs/2305.18290) |
584
+ | **Paper: "Training language models to follow instructions" (InstructGPT)** | Original RLHF pipeline | ~2 hrs | [arxiv:2203.02155](https://arxiv.org/abs/2203.02155) |
585
+ | TRL DPO Documentation | `DPOTrainer`, `DPOConfig` | ~1 hr | [HF TRL Docs](https://huggingface.co/docs/trl/dpo_trainer) |
586
+
587
+ #### Key Concepts to Master
588
+
589
+ - [ ] The alignment problem β€” why SFT alone isn't enough
590
+ - [ ] Human preferences β€” "which response is better?" β†’ preference data
591
+ - [ ] RLHF pipeline: SFT β†’ Reward Model β†’ PPO optimization
592
+ - [ ] DPO β€” bypasses the reward model, optimizes preferences directly
593
+ - [ ] The DPO loss: `L_DPO = -log Οƒ(Ξ² Β· (log Ο€(y_w|x)/Ο€_ref(y_w|x) - log Ο€(y_l|x)/Ο€_ref(y_l|x)))`
594
+ - [ ] `Ξ²` parameter β€” controls deviation from reference policy
595
+ - [ ] Preference dataset format β€” `prompt`, `chosen`, `rejected` columns
596
+ - [ ] KL divergence β€” preventing the model from straying too far
597
+
598
+ #### Dataset Format for DPO
599
+ ```json
600
+ {
601
+ "prompt": "Explain quantum computing simply.",
602
+ "chosen": "Quantum computing uses qubits that can be 0, 1, or both at once...",
603
+ "rejected": "Quantum computing is a type of computing that uses quantum mechanics..."
604
+ }
605
+ ```
606
+
607
+ #### πŸ”¨ Project: Align a Model with DPO
608
+ 1. Start from your SFT model (Week 13)
609
+ 2. Load a preference dataset
610
+ 3. Train with `DPOTrainer`
611
+ 4. Compare: base β†’ SFT β†’ DPO outputs side by side
612
+ 5. Evaluate helpfulness and safety improvements
613
+
614
+ ---
615
+
616
+ ### Week 16: Reasoning Models β€” GRPO & Open R1
617
+
618
+ #### Study Material
619
+
620
+ | Resource | Topic | Time | Link |
621
+ |----------|-------|------|------|
622
+ | HF NLP Course: Chapter 12 β€” Reasoning Models | GRPO, DeepSeek R1, Open R1 project | ~4 hrs | [HF Course Ch.12](https://huggingface.co/learn/nlp-course/chapter12/1) |
623
+ | **Paper: "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning"** | The breakthrough reasoning model | ~3 hrs | [arxiv:2501.12948](https://arxiv.org/abs/2501.12948) β€” [HF Paper](https://hf.co/papers/2501.12948) |
624
+ | Deep RL Course: Units 1–4 | RL fundamentals, policy gradients, PPO | ~6 hrs | [HF Deep RL Course](https://huggingface.co/learn/deep-rl-course/unit0/introduction) |
625
+
626
+ #### Key Concepts to Master
627
+
628
+ - [ ] Why RL for reasoning β€” SFT on reasoning traces vs discovering reasoning via RL
629
+ - [ ] DeepSeek-R1-Zero β€” reasoning emerges purely from RL (no SFT)
630
+ - [ ] Group Relative Policy Optimization (GRPO) β€” reward relative to group average
631
+ - [ ] Reward functions β€” correctness checkers, format validators
632
+ - [ ] Chain-of-thought β€” step-by-step reasoning in `<think>` tags
633
+ - [ ] Cold-start data β€” bootstrapping reasoning with SFT before RL
634
+ - [ ] Multi-stage training: SFT β†’ RL (reasoning) β†’ SFT (readability) β†’ RL (all tasks)
635
+ - [ ] `GRPOTrainer` from TRL β€” practical implementation
636
+
637
+ #### The GRPO Algorithm (Simplified)
638
+ ```
639
+ For each prompt x:
640
+ 1. Generate K responses: {y₁, yβ‚‚, ..., yβ‚–}
641
+ 2. Score each: {r₁, rβ‚‚, ..., rβ‚–}
642
+ 3. Compute group advantage: Aα΅’ = (rα΅’ - mean(r)) / std(r)
643
+ 4. Update policy to increase probability of high-advantage responses
644
+ 5. Apply KL penalty to stay close to reference model
645
+ ```
646
+
647
+ #### πŸ”¨ Project: Train a Math Reasoning Model with GRPO
648
+ 1. Start from a small instruct model
649
+ 2. Prepare a math dataset with verifiable answers (GSM8K format)
650
+ 3. Define reward functions (correctness + format)
651
+ 4. Train with `GRPOTrainer`
652
+ 5. Evaluate on held-out math problems
653
+ 6. Observe the model learn to use `<think>` reasoning
654
+
655
+ ---
656
+
657
+ ## Phase 6: Advanced Topics & Capstone (Weeks 17–20)
658
+
659
+ ### 🎯 Goal
660
+ Evaluation, synthetic data, deployment, agents, and a capstone project.
661
+
662
+ ---
663
+
664
+ ### Week 17: Evaluation & Benchmarks
665
+
666
+ #### Study Material
667
+
668
+ | Resource | Topic | Time | Link |
669
+ |----------|-------|------|------|
670
+ | smol-course: Module 4 β€” Evaluation | LLM evaluation, `lighteval` | ~3 hrs | [GitHub](https://github.com/huggingface/smol-course) |
671
+ | HF `lighteval` Documentation | Running standardized benchmarks | ~2 hrs | [HF lighteval Docs](https://huggingface.co/docs/lighteval) |
672
+ | Open LLM Leaderboard | Understanding LLM benchmarks | ~1 hr | [HF Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) |
673
+
674
+ #### Key Concepts to Master
675
+
676
+ - [ ] Perplexity β€” `exp(average cross-entropy loss)` β€” lower is better
677
+ - [ ] MMLU β€” Massive Multitask Language Understanding (knowledge test)
678
+ - [ ] GSM8K β€” Grade School Math 8K (reasoning benchmark)
679
+ - [ ] HumanEval β€” coding benchmark (pass@k)
680
+ - [ ] TruthfulQA β€” measuring hallucination tendencies
681
+ - [ ] LLM-as-judge β€” using a strong LLM to evaluate a weaker one
682
+ - [ ] Contamination β€” when evaluation data leaks into training data
683
+
684
+ #### πŸ”¨ Project: Benchmark Your Models
685
+ Run `lighteval` on all models you've trained so far. Create a comparison table:
686
+ | Model | Perplexity | MMLU | GSM8K | Notes |
687
+ |-------|-----------|------|-------|-------|
688
+ | Base SmolLM2 | ... | ... | ... | Pretrained |
689
+ | + SFT | ... | ... | ... | Week 13 |
690
+ | + LoRA | ... | ... | ... | Week 14 |
691
+ | + DPO | ... | ... | ... | Week 15 |
692
+ | + GRPO | ... | ... | ... | Week 16 |
693
+
694
+ ---
695
+
696
+ ### Week 18: Synthetic Data & Inference Optimization
697
+
698
+ #### Study Material
699
+
700
+ | Resource | Topic | Time | Link |
701
+ |----------|-------|------|------|
702
+ | smol-course: Module 6 β€” Synthetic Data | Generating training data with LLMs | ~3 hrs | [GitHub](https://github.com/huggingface/smol-course) |
703
+ | smol-course: Module 7 β€” Inference | Quantization, vLLM, optimization | ~3 hrs | [GitHub](https://github.com/huggingface/smol-course) |
704
+ | HF `bitsandbytes` Documentation | 4-bit and 8-bit quantization | ~1 hr | [HF BnB Docs](https://huggingface.co/docs/bitsandbytes) |
705
+ | HF `distilabel` Documentation | Synthetic data pipelines | ~2 hrs | [HF Distilabel Docs](https://huggingface.co/docs/distilabel) |
706
+
707
+ #### Key Concepts to Master
708
+
709
+ - [ ] Knowledge distillation β€” training a small model on a large model's outputs
710
+ - [ ] Synthetic data generation β€” using LLMs to create training data
711
+ - [ ] Data quality filtering β€” removing bad synthetic examples
712
+ - [ ] Quantization (INT8, INT4, GPTQ, AWQ) β€” reducing model size
713
+ - [ ] KV-cache β€” speeding up autoregressive generation
714
+ - [ ] Batched inference β€” processing multiple requests efficiently
715
+ - [ ] vLLM β€” high-throughput inference server
716
+ - [ ] Speculative decoding β€” using a small model to speed up a large model
717
+
718
+ ---
719
+
720
+ ### Week 19: AI Agents & Tool Use
721
+
722
+ #### Study Material
723
+
724
+ | Resource | Topic | Time | Link |
725
+ |----------|-------|------|------|
726
+ | HF Agents Course: Units 0–2 | Agent fundamentals, frameworks | ~6 hrs | [HF Agents Course](https://huggingface.co/learn/agents-course/unit0/introduction) |
727
+ | smol-course: Module 8 β€” Agents | Building agents with smolagents | ~3 hrs | [GitHub](https://github.com/huggingface/smol-course) |
728
+ | HF `smolagents` Documentation | Lightweight agent framework | ~2 hrs | [HF smolagents Docs](https://huggingface.co/docs/smolagents) |
729
+
730
+ #### Key Concepts to Master
731
+
732
+ - [ ] Agent loop: Thought β†’ Action β†’ Observation β†’ repeat
733
+ - [ ] Tool definition β€” giving LLMs access to functions
734
+ - [ ] Function calling format β€” how models invoke tools
735
+ - [ ] ReAct pattern β€” reasoning + acting in alternation
736
+ - [ ] RAG (Retrieval Augmented Generation) β€” grounding in external knowledge
737
+ - [ ] Code agents β€” generating and executing Python code
738
+ - [ ] Multi-agent systems β€” agents collaborating or delegating
739
+
740
+ #### πŸ”¨ Project: Build a RAG Agent
741
+ 1. Create a knowledge base from a set of documents
742
+ 2. Build a retrieval tool using sentence embeddings
743
+ 3. Wire up a smolagents agent that answers questions using retrieval
744
+ 4. Deploy as a Gradio app on HF Spaces
745
+
746
+ ---
747
+
748
+ ### Week 20: Capstone Project πŸŽ“
749
+
750
+ #### Choose One (or combine multiple):
751
+
752
+ **Option A: Train and Deploy a Domain Expert Model**
753
+ 1. Collect/curate a domain-specific dataset (medical, legal, code, science)
754
+ 2. SFT a base model on your dataset
755
+ 3. Apply LoRA for efficient training
756
+ 4. Evaluate on domain-specific benchmarks
757
+ 5. Deploy as an API + Gradio demo on HF Spaces
758
+ 6. Write a detailed model card
759
+
760
+ **Option B: Build a Reasoning Model for Math**
761
+ 1. Generate synthetic math reasoning data
762
+ 2. SFT a model on chain-of-thought examples
763
+ 3. Apply GRPO to improve reasoning quality
764
+ 4. Evaluate on GSM8K and MATH benchmarks
765
+ 5. Compare SFT-only vs SFT+GRPO
766
+ 6. Write a blog post about your findings
767
+
768
+ **Option C: Create an End-to-End Agent System**
769
+ 1. Fine-tune a model for function calling
770
+ 2. Build a multi-tool agent (web search, calculator, code execution)
771
+ 3. Add RAG with a custom knowledge base
772
+ 4. Evaluate agent performance on tasks
773
+ 5. Deploy as a Space with observability/logging
774
+ 6. Write documentation for others to use
775
+
776
+ #### Deliverables
777
+ - [ ] Trained model(s) on Hugging Face Hub
778
+ - [ ] Model card with training details, evaluations, limitations
779
+ - [ ] Gradio demo deployed on HF Spaces
780
+ - [ ] Blog post / README documenting your journey
781
+
782
+ ---
783
+
784
+ ## Reading List: Landmark Papers
785
+
786
+ These are the papers that defined the field. Read them in this order as you progress through the plan.
787
+
788
+ ### Tier 1: Must Read (referenced directly in the plan)
789
+
790
+ | # | Paper | Year | Key Contribution | Link |
791
+ |---|-------|------|-------------------|------|
792
+ | 1 | **Attention Is All You Need** (Vaswani et al.) | 2017 | Transformer architecture | [arxiv:1706.03762](https://arxiv.org/abs/1706.03762) |
793
+ | 2 | **BERT** (Devlin et al.) | 2018 | Bidirectional pretraining, MLM | [arxiv:1810.04805](https://arxiv.org/abs/1810.04805) |
794
+ | 3 | **Language Models are Few-Shot Learners** (GPT-3) | 2020 | In-context learning, scaling | [arxiv:2005.14165](https://arxiv.org/abs/2005.14165) |
795
+ | 4 | **Training Compute-Optimal LLMs** (Chinchilla) | 2022 | Scaling laws, data vs params | [arxiv:2203.15556](https://arxiv.org/abs/2203.15556) |
796
+ | 5 | **Training LMs to Follow Instructions** (InstructGPT) | 2022 | RLHF pipeline | [arxiv:2203.02155](https://arxiv.org/abs/2203.02155) |
797
+ | 6 | **LoRA** (Hu et al.) | 2021 | Parameter-efficient fine-tuning | [arxiv:2106.09685](https://arxiv.org/abs/2106.09685) |
798
+ | 7 | **Direct Preference Optimization** (Rafailov et al.) | 2023 | DPO β€” alignment without RL | [arxiv:2305.18290](https://arxiv.org/abs/2305.18290) |
799
+ | 8 | **DeepSeek-R1** (DeepSeek AI) | 2025 | GRPO, reasoning via RL | [arxiv:2501.12948](https://arxiv.org/abs/2501.12948) |
800
+
801
+ ### Tier 2: Highly Recommended (deepens understanding)
802
+
803
+ | # | Paper | Year | Key Contribution | Link |
804
+ |---|-------|------|-------------------|------|
805
+ | 9 | **Neural Machine Translation by Jointly Learning to Align and Translate** (Bahdanau et al.) | 2014 | Attention mechanism (before Transformers) | [arxiv:1409.0473](https://arxiv.org/abs/1409.0473) |
806
+ | 10 | **A Neural Probabilistic Language Model** (Bengio et al.) | 2003 | Word embeddings, neural LMs | [jmlr.org](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) |
807
+ | 11 | **LLaMA: Open and Efficient Foundation Language Models** (Touvron et al.) | 2023 | Efficient open-source LLM recipe | [arxiv:2302.13971](https://arxiv.org/abs/2302.13971) |
808
+ | 12 | **QLoRA** (Dettmers et al.) | 2023 | 4-bit fine-tuning on consumer GPUs | [arxiv:2305.14314](https://arxiv.org/abs/2305.14314) |
809
+ | 13 | **SmolLM2** (Allal et al.) | 2025 | Small model training recipe | [arxiv:2502.02737](https://arxiv.org/abs/2502.02737) |
810
+ | 14 | **Chain-of-Thought Prompting** (Wei et al.) | 2022 | Step-by-step reasoning | [arxiv:2201.11903](https://arxiv.org/abs/2201.11903) |
811
+
812
+ ### Tier 3: Reference (consult when needed)
813
+
814
+ | # | Paper | Year | Key Contribution | Link |
815
+ |---|-------|------|-------------------|------|
816
+ | 15 | **FlashAttention** (Dao et al.) | 2022 | Memory-efficient attention | [arxiv:2205.14135](https://arxiv.org/abs/2205.14135) |
817
+ | 16 | **RoFormer / RoPE** (Su et al.) | 2021 | Rotary position embeddings | [arxiv:2104.09864](https://arxiv.org/abs/2104.09864) |
818
+ | 17 | **GQA: Grouped Query Attention** (Ainslie et al.) | 2023 | Efficient attention variant | [arxiv:2305.13245](https://arxiv.org/abs/2305.13245) |
819
+ | 18 | **Scaling Laws for Neural LMs** (Kaplan et al.) | 2020 | Power-law scaling | [arxiv:2001.08361](https://arxiv.org/abs/2001.08361) |
820
+
821
+ ---
822
+
823
+ ## All Resources & Links
824
+
825
+ ### πŸ“š Courses
826
+
827
+ | Course | Provider | URL |
828
+ |--------|----------|-----|
829
+ | NLP / LLM Course (12 chapters) | Hugging Face | [huggingface.co/learn/nlp-course](https://huggingface.co/learn/nlp-course/chapter1/1) |
830
+ | smol-course (8 modules) | Hugging Face | [github.com/huggingface/smol-course](https://github.com/huggingface/smol-course) |
831
+ | Deep RL Course | Hugging Face | [huggingface.co/learn/deep-rl-course](https://huggingface.co/learn/deep-rl-course/unit0/introduction) |
832
+ | AI Agents Course | Hugging Face | [huggingface.co/learn/agents-course](https://huggingface.co/learn/agents-course/unit0/introduction) |
833
+ | Neural Networks: Zero to Hero | Andrej Karpathy | [YouTube Playlist](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ) |
834
+ | Practical Deep Learning | fast.ai | [course.fast.ai](https://course.fast.ai/) |
835
+
836
+ ### 🎬 Key Videos
837
+
838
+ | Video | Creator | Duration | Link |
839
+ |-------|---------|----------|------|
840
+ | Let's Build GPT | Karpathy | ~2 hrs | [youtu.be/kCc8FmEb1nY](https://youtu.be/kCc8FmEb1nY) |
841
+ | Let's Build the GPT Tokenizer | Karpathy | ~2 hrs | [YouTube](https://www.youtube.com/watch?v=zduSFxRajkE) |
842
+ | Reproducing GPT-2 (124M) | Karpathy | ~4 hrs | [YouTube](https://www.youtube.com/watch?v=l8pRSuU81PU) |
843
+ | Attention in Transformers | 3Blue1Brown | ~45 min | [YouTube](https://www.youtube.com/watch?v=eMlx5fFNoYc) |
844
+ | Essence of Linear Algebra | 3Blue1Brown | ~3.5 hrs | [YouTube Playlist](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) |
845
+
846
+ ### πŸ› οΈ Code Repositories
847
+
848
+ | Repo | Description | Link |
849
+ |------|-------------|------|
850
+ | nanoGPT | Simplest GPT training code | [github.com/karpathy/nanoGPT](https://github.com/karpathy/nanoGPT) |
851
+ | minGPT | Educational GPT implementation | [github.com/karpathy/minGPT](https://github.com/karpathy/minGPT) |
852
+ | makemore | Character LM progression | [github.com/karpathy/makemore](https://github.com/karpathy/makemore) |
853
+ | micrograd | Autograd engine from scratch | [github.com/karpathy/micrograd](https://github.com/karpathy/micrograd) |
854
+
855
+ ### πŸ“¦ HF Libraries
856
+
857
+ | Library | Purpose | Docs |
858
+ |---------|---------|------|
859
+ | `transformers` | Model loading, inference, training | [huggingface.co/docs/transformers](https://huggingface.co/docs/transformers) |
860
+ | `datasets` | Data loading & processing | [huggingface.co/docs/datasets](https://huggingface.co/docs/datasets) |
861
+ | `tokenizers` | Fast tokenizer training | [huggingface.co/docs/tokenizers](https://huggingface.co/docs/tokenizers) |
862
+ | `trl` | SFT, DPO, GRPO trainers | [huggingface.co/docs/trl](https://huggingface.co/docs/trl) |
863
+ | `peft` | LoRA, QLoRA, adapters | [huggingface.co/docs/peft](https://huggingface.co/docs/peft) |
864
+ | `accelerate` | Distributed & mixed-precision training | [huggingface.co/docs/accelerate](https://huggingface.co/docs/accelerate) |
865
+ | `evaluate` | Metrics & evaluation | [huggingface.co/docs/evaluate](https://huggingface.co/docs/evaluate) |
866
+ | `lighteval` | LLM benchmarking | [huggingface.co/docs/lighteval](https://huggingface.co/docs/lighteval) |
867
+ | `smolagents` | Agent framework | [huggingface.co/docs/smolagents](https://huggingface.co/docs/smolagents) |
868
+ | `gradio` | ML demos | [gradio.app](https://www.gradio.app/) |
869
+ | `bitsandbytes` | Quantization | [huggingface.co/docs/bitsandbytes](https://huggingface.co/docs/bitsandbytes) |
870
+ | `distilabel` | Synthetic data | [huggingface.co/docs/distilabel](https://huggingface.co/docs/distilabel) |
871
+
872
+ ### πŸ€– Models to Use
873
+
874
+ | Model | Size | Use Case | Link |
875
+ |-------|------|----------|------|
876
+ | SmolLM2-135M | 135M | Learning, fast experiments | [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) |
877
+ | SmolLM2-360M | 360M | Small model training | [HuggingFaceTB/SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M) |
878
+ | SmolLM2-1.7B | 1.7B | QLoRA on consumer GPU | [HuggingFaceTB/SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B) |
879
+ | GPT-2 | 124M | Reference implementation | [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) |
880
+ | DistilBERT | 66M | Classification tasks | [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) |
881
+
882
+ ### πŸ“Š Datasets to Use
883
+
884
+ | Dataset | Format | Use Case | Link |
885
+ |---------|--------|----------|------|
886
+ | smoltalk | ChatML (`messages`) | SFT training | [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) |
887
+ | Tiny Shakespeare | Raw text | nanoGPT pretraining | Included in nanoGPT repo |
888
+ | TinyStories | Text | Small model pretraining | [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) |
889
+ | IMDb | Classification | Fine-tuning practice | [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) |
890
+ | GSM8K | Math QA | Reasoning evaluation | [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k) |
891
+
892
+ ---
893
+
894
+ ## Appendix: Glossary of Key Terms
895
+
896
+ | Term | Definition |
897
+ |------|-----------|
898
+ | **Attention** | Mechanism that lets each token attend to all other tokens, computing relevance-weighted representations. Core formula: `softmax(QK^T/√d_k)V` |
899
+ | **Autoregressive** | Generating one token at a time, each conditioned on all previous tokens. GPT-style models. |
900
+ | **Backpropagation** | Algorithm for computing gradients of the loss w.r.t. all parameters by applying the chain rule backward through the computation graph. |
901
+ | **BPE (Byte Pair Encoding)** | Tokenization algorithm that iteratively merges the most frequent pair of tokens. Used by GPT-2, GPT-3, LLaMA. |
902
+ | **Causal Mask** | Lower-triangular mask that prevents tokens from attending to future positions. Makes the model autoregressive. |
903
+ | **ChatML** | Standard format for chat data: list of `{role, content}` dictionaries with roles `system`, `user`, `assistant`. |
904
+ | **Cross-Entropy Loss** | Standard loss for classification/language modeling: `-Ξ£ yα΅’ log(Ε·α΅’)`. Measures how well predicted distribution matches target. |
905
+ | **DPO (Direct Preference Optimization)** | Alignment method that directly optimizes the policy from preference pairs, without training a separate reward model. |
906
+ | **Embedding** | Dense vector representation of a discrete token. Learned lookup table mapping token IDs to vectors. |
907
+ | **Fine-Tuning** | Continuing training of a pretrained model on a specific downstream task or dataset. |
908
+ | **GRPO (Group Relative Policy Optimization)** | RL algorithm that updates the policy based on relative advantage within a group of sampled responses. Used by DeepSeek-R1. |
909
+ | **Gradient Accumulation** | Simulating large batch sizes by accumulating gradients over multiple forward/backward passes before updating weights. |
910
+ | **KV-Cache** | Caching key and value tensors from previous tokens during autoregressive generation, avoiding recomputation. |
911
+ | **Layer Normalization** | Normalizing activations across the feature dimension (not the batch dimension). Stabilizes Transformer training. |
912
+ | **LoRA** | Adding small low-rank matrices (BΓ—A where rank r << d) to existing weight matrices. Trains ~0.1% of parameters. |
913
+ | **Perplexity** | `exp(cross-entropy loss)`. Intuitively: how many tokens the model is "confused" between. Lower = better. |
914
+ | **Positional Encoding** | Information added to token embeddings so the model knows the order of tokens. Sinusoidal (original) or learned (GPT-2). |
915
+ | **Pretraining** | Initial training on a large unlabeled corpus (next-token prediction). Creates the base model. |
916
+ | **QLoRA** | LoRA applied to a 4-bit quantized base model. Enables fine-tuning 65B models on a single 48GB GPU. |
917
+ | **Quantization** | Reducing numerical precision (fp32 β†’ fp16 β†’ int8 β†’ int4) to reduce model size and speed up inference. |
918
+ | **Residual Connection** | `output = x + f(x)`. Allows gradients to flow directly through the network, enabling very deep models. |
919
+ | **RLHF** | Reinforcement Learning from Human Feedback. Pipeline: SFT β†’ Reward Model β†’ PPO. Original alignment method (InstructGPT). |
920
+ | **Scaling Laws** | Empirical finding that LM loss follows a power law: `L(N) ∝ N^(-α)`. Predicts performance from compute budget. |
921
+ | **Self-Attention** | Attention where queries, keys, and values all come from the same sequence. Each token attends to all tokens in the sequence. |
922
+ | **SFT (Supervised Fine-Tuning)** | Fine-tuning on instruction-response pairs. Transforms base models into helpful assistants. |
923
+ | **Softmax** | `softmax(xα΅’) = exp(xα΅’) / Ξ£ exp(xβ±Ό)`. Converts raw scores (logits) to a probability distribution. |
924
+ | **Temperature** | Scaling factor applied to logits before softmax during generation. Higher = more random, lower = more deterministic. |
925
+ | **Token** | The atomic unit of text for the model. Can be a character, subword, or word depending on the tokenizer. |
926
+ | **Transformer** | Neural network architecture based on self-attention, introduced in "Attention Is All You Need" (2017). Foundation of all modern LLMs. |
927
+
928
+ ---
929
+
930
+ ## Progress Tracker
931
+
932
+ Use this checklist to track your progress:
933
+
934
+ ### Phase 1: Foundations ☐
935
+ - [ ] Week 1: Linear algebra & calculus videos complete
936
+ - [ ] Week 1: Implemented `matmul`, `softmax`, `cross_entropy` from scratch
937
+ - [ ] Week 2: Watched 3B1B neural networks series
938
+ - [ ] Week 2: Built micrograd (autograd engine)
939
+ - [ ] Week 3: Completed PyTorch 60-min blitz
940
+ - [ ] Week 3: Built bigram + MLP language models (makemore Parts 1–2)
941
+
942
+ ### Phase 2: Transformer Architecture ☐
943
+ - [ ] Week 4: Completed makemore Parts 3–5
944
+ - [ ] Week 4: Can manually backpropagate through a small network
945
+ - [ ] Week 5: Read "Attention Is All You Need" (all of Β§3)
946
+ - [ ] Week 5: Can draw the full Transformer architecture from memory
947
+ - [ ] Week 6: Watched "Let's Build GPT" and implemented along
948
+ - [ ] Week 6: Trained a working GPT on Shakespeare that generates text
949
+
950
+ ### Phase 3: Language Modeling ☐
951
+ - [ ] Week 7: Implemented BPE from scratch
952
+ - [ ] Week 7: Trained a HuggingFace tokenizer on custom data
953
+ - [ ] Week 8: Read Chinchilla & GPT-3 papers
954
+ - [ ] Week 8: Can calculate FLOPs and training time for a given model size
955
+ - [ ] Week 9: Pretrained a small GPT (10M–50M params)
956
+ - [ ] Week 9: Pushed model to Hugging Face Hub
957
+
958
+ ### Phase 4: HF Ecosystem ☐
959
+ - [ ] Week 10: Loaded and ran 5 different models via `pipeline()`
960
+ - [ ] Week 11: Fine-tuned a text classifier with `Trainer`
961
+ - [ ] Week 11: Model pushed to Hub
962
+ - [ ] Week 12: Deployed a Gradio demo on HF Spaces
963
+
964
+ ### Phase 5: Fine-Tuning & Alignment ☐
965
+ - [ ] Week 13: SFT'd SmolLM2 into a chat model
966
+ - [ ] Week 14: Applied QLoRA to a 1.7B model
967
+ - [ ] Week 15: Trained a DPO-aligned model
968
+ - [ ] Week 16: Trained a GRPO reasoning model
969
+
970
+ ### Phase 6: Advanced ☐
971
+ - [ ] Week 17: Benchmarked all models with `lighteval`
972
+ - [ ] Week 18: Generated synthetic data, quantized a model
973
+ - [ ] Week 19: Built a RAG agent with smolagents
974
+ - [ ] Week 20: Completed capstone project
975
+
976
+ ---
977
+
978
+ > *"The best way to understand LLMs is to build one from scratch. The second best way is to train one. The third best way is to read the papers. Do all three."*
979
+
980
+ **Good luck on your journey! πŸš€**