📚 Complete 20-week study plan: Learn LLMs from First Principles

ad4cb5e verified 25 days ago

50.6 kB

title: LLM from First Principles
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: static
pinned: true
tags:
  - llm
  - tutorial
  - learning-path
  - transformers
  - from-scratch

🧠 Learn LLMs from First Principles — A Complete Study Plan

Goal: Go from zero to confidently understanding, building, fine-tuning, aligning, and deploying Large Language Models.

Time commitment: ~20 weeks (10–15 hrs/week) — flexible, self-paced.

Prerequisites: Python proficiency. Basic calculus & linear algebra (or willingness to learn in Week 1).

Overview & Philosophy
Phase 1: Mathematical & Neural Network Foundations (Weeks 1–3)
Phase 2: The Transformer Architecture from Scratch (Weeks 4–6)
Phase 3: Language Modeling & Pretraining (Weeks 7–9)
Phase 4: The Hugging Face Ecosystem (Weeks 10–12)
Phase 5: Fine-Tuning & Alignment (Weeks 13–16)
Phase 6: Advanced Topics & Capstone (Weeks 17–20)
Reading List: Landmark Papers
All Resources & Links
Appendix: Glossary of Key Terms

Overview & Philosophy

This plan is built on one principle: understand by building. Rather than memorizing APIs, you will:

Derive the math (attention, backpropagation, loss functions)
Implement each concept in raw PyTorch before using libraries
Train real models on real data
Read the original papers that introduced each idea

The plan follows a bottom-up progression:

Math Foundations ──► Neural Networks ──► Transformers ──► Language Models
      │                     │                  │                │
      ▼                     ▼                  ▼                ▼
  Linear Algebra       Backprop &          Self-Attention     GPT-2 from
  Calculus             Gradient Descent    Multi-Head Attn    Scratch
  Probability          MLP, RNN, LSTM     LayerNorm, FFN     Tokenization
                                          Pos. Encoding      Pretraining

──► Fine-Tuning ──► Alignment ──► Reasoning ──► Agents & Deployment
         │               │             │                │
         ▼               ▼             ▼                ▼
     SFT, LoRA       RLHF, DPO     GRPO, R1       Tool-use, RAG
     Chat Templates  Reward Models  Chain-of-Thought  Deployment
     PEFT            PPO            Open R1           Quantization

Phase 1: Mathematical & Neural Network Foundations (Weeks 1–3)

🎯 Goal

Build intuition for the math that powers every neural network, then implement a neural network from scratch.

Week 1: Linear Algebra, Calculus & Probability

"If you don't understand the math, you're just calling functions."

Study Material

Resource	Topic	Time	Link
3Blue1Brown: Essence of Linear Algebra	Vectors, matrices, transformations, eigenvectors	~3.5 hrs (16 videos)	YouTube Playlist
3Blue1Brown: Essence of Calculus	Derivatives, integrals, chain rule	~3 hrs (12 videos)	YouTube Playlist
StatQuest: Probability Fundamentals	Bayes theorem, distributions, MLE	~2 hrs	YouTube Channel

Key Concepts to Master

Matrix multiplication — what it means geometrically (not just how to compute it)
Dot product — as projection and similarity measure
Softmax function — converting raw scores to probabilities
Chain rule — the foundation of backpropagation
Cross-entropy loss — measuring how wrong a probability distribution is
Maximum likelihood estimation — why we minimize negative log-likelihood

📝 Exercise

Implement these in Python (no NumPy):

def matmul(A, B): ...          # Matrix multiplication from scratch
def softmax(x): ...            # Softmax function
def cross_entropy(pred, target): ...  # Cross-entropy loss

Week 2: Neural Networks from Scratch

Study Material

Resource	Topic	Time	Link
3Blue1Brown: Neural Networks	What is a neural network, gradient descent, backprop	~1 hr (4 videos)	YouTube Playlist
3Blue1Brown: Attention in Transformers	Visual explanation of attention mechanism	~45 min	YouTube
Karpathy: The spelled-out intro to neural networks and backpropagation	Build micrograd — a tiny autograd engine	~2.5 hrs	YouTube

Key Concepts to Master

Forward pass — computing outputs layer by layer
Loss function — quantifying prediction error
Backward pass (backpropagation) — computing gradients via chain rule
Gradient descent — updating weights to minimize loss
Learning rate — step size for weight updates
Computational graph — tracking operations for automatic differentiation

🔨 Project: Build micrograd

Follow Karpathy's video and build a complete autograd engine from scratch:

Value class with __add__, __mul__, __pow__
Automatic gradient computation via .backward()
Train a simple MLP on a toy dataset
Repo: github.com/karpathy/micrograd

Week 3: PyTorch Fundamentals & MLPs

Study Material

Resource	Topic	Time	Link
PyTorch Official: 60-min Blitz	Tensors, autograd, nn.Module	~2 hrs	pytorch.org/tutorials
fast.ai Lesson 1	Practical deep learning, training loops	~2 hrs	course.fast.ai
Karpathy: makemore Part 1 (Bigrams)	Character-level language model, counting	~1.5 hrs	YouTube
Karpathy: makemore Part 2 (MLP)	MLP language model (Bengio et al., 2003)	~1.5 hrs	YouTube

Key Concepts to Master

torch.Tensor — creation, indexing, broadcasting, GPU transfer
torch.nn.Module — defining layers, forward method
torch.optim — SGD, Adam
loss.backward() — automatic differentiation
Training loop: forward → loss → backward → step → zero_grad
Bigram model — simplest language model (predicting next character from previous one)
MLP language model — predicting next character from a window of previous characters

🔨 Project: Character-level Language Model

Follow makemore Parts 1 & 2:

Build a bigram model (lookup table / counting approach)
Build an MLP model that takes N characters and predicts the next
Train on a names dataset and generate new names

Repo: github.com/karpathy/makemore

Phase 2: The Transformer Architecture from Scratch (Weeks 4–6)

🎯 Goal

Understand every component of the Transformer and implement it from scratch in PyTorch.

Week 4: Deep Dive into Training Dynamics

Study Material

Resource	Topic	Time	Link
Karpathy: makemore Part 3	Batch normalization, weight initialization, activation statistics	~1.5 hrs	YouTube
Karpathy: makemore Part 4	Becoming a backprop ninja — manual backpropagation	~1.5 hrs	YouTube
Karpathy: makemore Part 5	WaveNet-style deep networks, dilated causal convolutions	~1 hr	YouTube

Key Concepts to Master

Batch normalization — why it stabilizes training, running mean/var
Weight initialization — Kaiming/He init, why it matters
Activation statistics — dead neurons, saturation, vanishing gradients
Manual backpropagation — computing gradients by hand through every operation
Residual connections — why x + f(x) helps deep networks
Hierarchical models — building deeper architectures

📝 Exercise

Take your makemore MLP and manually compute the gradient of every parameter by hand (no .backward()). Verify against PyTorch autograd.

Week 5: The Transformer — Theory & Paper Reading

Study Material

Resource	Topic	Time	Link
Paper: "Attention Is All You Need"	The original Transformer paper	~3 hrs (multiple readings)	arxiv:1706.03762
3Blue1Brown: Attention in Transformers	Visual explanation of QKV, multi-head attention	~45 min	YouTube
Jay Alammar: The Illustrated Transformer	Step-by-step visual walkthrough	~1.5 hrs	jalammar.github.io
Lilian Weng: "Attention? Attention!"	Comprehensive attention mechanisms survey	~2 hrs	lilianweng.github.io

Key Sections of "Attention Is All You Need" to Study

Section	What to Learn	Core Formula
§3.1 Encoder & Decoder Stacks	N=6 layers, residual + LayerNorm	`LayerNorm(x + Sublayer(x))`
§3.2.1 Scaled Dot-Product Attention	The heart of everything	`Attention(Q,K,V) = softmax(QK^T / √d_k) V`
§3.2.2 Multi-Head Attention	Parallel attention in subspaces	`MultiHead = Concat(head_1..h) W^O`
§3.3 Position-wise FFN	Two-layer MLP per position	`FFN(x) = max(0, xW₁+b₁)W₂+b₂`
§3.5 Positional Encoding	Sinusoidal position signals	`PE(pos,2i) = sin(pos/10000^(2i/d))`
§4 Why Self-Attention	O(1) sequential ops vs O(n) for RNN	Complexity comparison table
§5.3 Optimizer	Learning rate warmup schedule	`lr = d^(-0.5) · min(step^(-0.5), step·warmup^(-1.5))`

Key Concepts to Master

Query, Key, Value — what each represents intuitively
Scaled dot-product attention — why we scale by √d_k
Multi-head attention — why multiple heads are better than one
Causal masking — preventing the model from "seeing the future"
Positional encoding — how the model knows word order
Layer normalization — stabilizing training in transformers
Encoder vs. Decoder vs. Encoder-Decoder architectures
Residual connections — enabling gradient flow in deep networks

📝 Exercise

Draw the complete Transformer architecture from memory, labeling every component with its dimensions (for d_model=512, h=8, d_k=64, d_ff=2048, N=6).

Week 6: Build GPT from Scratch

This is the most important week in the entire plan.

Study Material

Resource	Topic	Time	Link
Karpathy: "Let's Build GPT: from scratch, in code, spelled out"	Complete GPT implementation in PyTorch	~2 hrs	YouTube
nanoGPT `model.py`	Production-quality GPT implementation (~300 lines)	~3 hrs (read line by line)	GitHub
nanoGPT `train.py`	Training loop with DDP, AMP, gradient accumulation	~2 hrs	GitHub

What You'll Build (following the video)

Input Text
    ↓
[Token Embedding] + [Position Embedding]
    ↓
┌─────────────────────────────────┐
│  Transformer Block (×N)         │
│  ├── LayerNorm                  │
│  ├── Multi-Head Self-Attention  │
│  │   ├── Q = x @ W_q           │
│  │   ├── K = x @ W_k           │
│  │   ├── V = x @ W_v           │
│  │   ├── Causal Mask            │
│  │   └── softmax(QK^T/√d) @ V  │
│  ├── Residual Connection        │
│  ├── LayerNorm                  │
│  ├── Feed-Forward (MLP)         │
│  │   ├── Linear → GELU → Linear│
│  └── Residual Connection        │
└─────────────────────────────────┘
    ↓
[LayerNorm]
    ↓
[Linear → Logits]
    ↓
[Softmax → Next Token Probabilities]

Key Concepts to Master

Token embedding table — mapping integers to vectors
Position embedding — learned (GPT-2 style) vs sinusoidal
Self-attention with causal mask — the tril trick
Multi-head attention — splitting d_model into h heads
Feed-forward network — expanding and contracting dimensions
Residual stream — the "highway" through the network
Pre-norm vs post-norm — GPT-2 uses pre-norm
Weight tying — sharing embedding and output projection weights
Temperature and top-k sampling — controlling text generation

🔨 Project: Your Own nanoGPT

Watch the video, pause, and implement each component yourself
Train a character-level GPT on Shakespeare (~1M characters)
Generate text and observe how quality improves with training
Experiment: change number of heads, layers, embedding dimension
Deliverable: A working GPT that generates coherent Shakespeare text

Phase 3: Language Modeling & Pretraining (Weeks 7–9)

🎯 Goal

Understand tokenization, scaling laws, and the full pretraining pipeline.

Week 7: Tokenization — The Unsung Hero

Study Material

Resource	Topic	Time	Link
Karpathy: "Let's build the GPT Tokenizer"	BPE from scratch, tiktoken, sentencepiece	~2 hrs	YouTube
HF NLP Course: Chapter 6 — Tokenizers	BPE, WordPiece, Unigram, `🤗 Tokenizers` library	~3 hrs	HF Course Ch.6
Paper: Sennrich et al. (2016)	Original BPE for NMT	~1 hr	arxiv:1508.07909

Key Concepts to Master

Why tokenization matters — it defines what the model "sees"
Character-level vs word-level vs subword tokenization
Byte Pair Encoding (BPE) — the greedy merge algorithm
WordPiece — BPE variant used by BERT
Unigram — probabilistic tokenizer used by T5
SentencePiece — language-agnostic tokenization
Special tokens — <bos>, <eos>, <pad>, <unk>
Vocabulary size tradeoffs — larger vocab = shorter sequences but bigger embedding table
The 🤗 Tokenizers library — training custom tokenizers in Rust-speed

🔨 Project: Train Your Own BPE Tokenizer

Implement BPE from scratch in Python (follow Karpathy)
Train a 🤗 Tokenizers BPE tokenizer on a custom corpus
Compare vocabulary sizes: 1K, 10K, 50K — observe token lengths
Decode some tokens and study what subwords the model learned

Week 8: Scaling Laws & Pretraining Concepts

Study Material

Resource	Topic	Time	Link
Paper: "Scaling Laws for Neural Language Models" (Kaplan et al.)	How loss scales with model size, data, compute	~2 hrs	arxiv:2001.08361
Paper: "Training Compute-Optimal LLMs" (Chinchilla)	Optimal ratio of model size to training tokens	~2 hrs	arxiv:2203.15556
Paper: "Language Models are Few-Shot Learners" (GPT-3)	In-context learning, few-shot prompting	~3 hrs	arxiv:2005.14165
Karpathy: "Reproducing GPT-2 (124M)"	Full pretraining pipeline on OpenWebText	~4 hrs	YouTube

Key Concepts to Master

Scaling laws — L(N) ∝ N^(-α) — loss decreases as a power law of model size
Chinchilla-optimal — train N parameters on ~20N tokens
Compute-optimal training — how to budget FLOPs between model size and data
In-context learning — how large models learn from examples in the prompt
Emergent abilities — capabilities that appear only at scale
Data quality vs quantity — why curated data beats raw web scrapes
Mixed-precision training (fp16/bf16) — training faster with lower precision
Gradient accumulation — simulating large batches on small GPUs
Distributed Data Parallel (DDP) — training across multiple GPUs

📝 Exercise

Calculate:

How many FLOPs to train a 1B parameter model on 20B tokens?
- Rule of thumb: FLOPs ≈ 6 × N × D (N=params, D=tokens)
If you have 8× A100 GPUs at 312 TFLOPS each, how long would training take?
What's the Chinchilla-optimal model size for a 1T token dataset?

Week 9: Pretraining a Small Model End-to-End

Study Material

Resource	Topic	Time	Link
HF NLP Course: Chapter 7 — Main NLP Tasks	Causal LM pretraining with HF Transformers	~4 hrs	HF Course Ch.7
nanoGPT README & `data/openwebtext/prepare.py`	Data preparation for pretraining	~1 hr	GitHub
Paper: SmolLM2 (Allal et al.)	How to train a good small LM	~2 hrs	arxiv:2502.02737

Key Concepts to Master

Dataset preparation — downloading, cleaning, tokenizing, chunking
Data loading — efficient batching, shuffling, packing sequences
Learning rate schedule — warmup + cosine decay
Loss curves — what a healthy training run looks like
Evaluation — perplexity, held-out validation loss
Checkpointing — saving model state during training
The Trainer API — HF's high-level training abstraction

🔨 Project: Pretrain a 10M–50M Parameter GPT

Using nanoGPT or HF Transformers:

Prepare a dataset (TinyStories, OpenWebText subset, or custom)
Configure a small GPT (6–12 layers, 256–512 dim, 4–8 heads)
Train for ~10K steps on a single GPU
Plot the loss curve, compute perplexity
Generate text samples at different checkpoints — watch quality improve
Push the model to Hugging Face Hub

Phase 4: The Hugging Face Ecosystem (Weeks 10–12)

🎯 Goal

Master the tools you'll use daily: Transformers, Datasets, Hub, and Gradio.

Week 10: Transformers Library & the Hub

Study Material

Resource	Topic	Time	Link
HF NLP Course: Chapter 1 — Transformer Models	`pipeline()`, model architectures, use cases	~2 hrs	HF Course Ch.1
HF NLP Course: Chapter 2 — Using Transformers	`AutoModel`, `AutoTokenizer`, batching	~3 hrs	HF Course Ch.2
HF NLP Course: Chapter 4 — Sharing on the Hub	Model cards, `push_to_hub()`, versioning	~1.5 hrs	HF Course Ch.4

Key Concepts to Master

pipeline() — high-level inference in one line
AutoModel.from_pretrained() — loading any model architecture
AutoTokenizer.from_pretrained() — loading the matching tokenizer
Encoder models (BERT) vs Decoder models (GPT) vs Encoder-Decoder (T5)
Model cards — documenting your models for others
The Hub — versioning, repos, community features

🔨 Mini-Project

Load 5 different models from the Hub and run inference:

Text generation (GPT-2 / SmolLM2)
Text classification (BERT / DistilBERT)
Summarization (T5 / BART)
Translation (MarianMT)
Fill-mask (BERT)

Week 11: Datasets, Fine-Tuning & Debugging

Study Material

Resource	Topic	Time	Link
HF NLP Course: Chapter 3 — Fine-Tuning	`Trainer` API, Accelerate, custom training loops	~4 hrs	HF Course Ch.3
HF NLP Course: Chapter 5 — Datasets Library	Loading, streaming, Arrow format, processing	~3 hrs	HF Course Ch.5
HF NLP Course: Chapter 8 — Debugging	Common training failures, community help	~1.5 hrs	HF Course Ch.8

Key Concepts to Master

datasets.load_dataset() — loading from Hub, local files, streaming
dataset.map() — applying transformations efficiently
Apache Arrow format — why HF datasets are fast
Trainer API — TrainingArguments, callbacks, logging
Custom training loop with Accelerate
Common errors: shape mismatches, tokenizer/model mismatches, OOM
Gradient checkpointing — trading compute for memory

🔨 Project: Fine-Tune a Text Classifier

Load a sentiment dataset (e.g., stanfordnlp/imdb)
Tokenize and prepare data with datasets
Fine-tune distilbert-base-uncased with Trainer
Evaluate accuracy on test set
Push the fine-tuned model to the Hub
Build a Gradio demo (preview of next week)

Week 12: Demos, Data Annotation & the Full Workflow

Study Material

Resource	Topic	Time	Link
HF NLP Course: Chapter 9 — Gradio	Building interactive ML demos	~2 hrs	HF Course Ch.9
HF NLP Course: Chapter 10 — Data Annotation	Argilla for dataset curation	~2 hrs	HF Course Ch.10
HF Spaces Documentation	Deploying models as web apps	~1 hr	HF Spaces

Key Concepts to Master

gr.Interface() — creating demos with Gradio
gr.Blocks() — advanced layouts and interactivity
Hugging Face Spaces — deploying demos for free
Data annotation with Argilla — creating high-quality datasets
Human-in-the-loop workflows — iterating on data quality

🔨 Project: Deploy a Model on HF Spaces

Take your fine-tuned classifier from Week 11
Build a Gradio app with text input → sentiment prediction
Deploy on Hugging Face Spaces
Share the link and get feedback

Phase 5: Fine-Tuning & Alignment (Weeks 13–16)

🎯 Goal

Learn the modern LLM training stack: SFT, LoRA, DPO, RLHF, and GRPO.

Week 13: Supervised Fine-Tuning (SFT) & Chat Models

Study Material

Resource	Topic	Time	Link
HF NLP Course: Chapter 11 — SFT	Chat templates, `SFTTrainer`, evaluation	~4 hrs	HF Course Ch.11
smol-course: Module 1 — Instruction Tuning	Hands-on SFT with SmolLM2	~3 hrs	GitHub
Paper: "Training language models to follow instructions" (InstructGPT)	The paper that started it all (OpenAI, 2022)	~2 hrs	arxiv:2203.02155

Key Concepts to Master

Chat templates (ChatML format) — system, user, assistant roles
The messages format — structured conversation data
SFTTrainer from TRL — the standard fine-tuning trainer
Dataset preparation — converting raw text to ChatML format
Packing — fitting multiple short examples in one sequence
Evaluation — loss curves, manual quality checks, benchmarks
Base model → Instruct model transformation

Dataset Format for SFT

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."}
  ]
}

🔨 Project: Fine-Tune SmolLM2 into a Chat Model

Load HuggingFaceTB/SmolLM2-135M
Prepare data from HuggingFaceTB/smoltalk
Apply chat template and train with SFTTrainer
Chat with your model and evaluate quality
Push to Hub as your-username/SmolLM2-135M-Chat

Week 14: Parameter-Efficient Fine-Tuning (LoRA / QLoRA)

Study Material

Resource	Topic	Time	Link
smol-course: Module 3 — PEFT	LoRA, QLoRA, adapter merging	~3 hrs	GitHub
Paper: "LoRA: Low-Rank Adaptation of Large Language Models"	The LoRA method	~2 hrs	arxiv:2106.09685 — HF Paper
HF PEFT Documentation	`LoraConfig`, `get_peft_model`, merging	~2 hrs	HF PEFT Docs

Key Concepts to Master

Why full fine-tuning is expensive — every parameter needs gradients + optimizer states
Low-rank decomposition — W + ΔW = W + BA where B is (d×r) and A is (r×d)
Rank r — the bottleneck dimension (typical: 8, 16, 32, 64)
lora_alpha — scaling factor for LoRA updates
Target modules — which layers to add LoRA to (q_proj, v_proj, k_proj, etc.)
QLoRA — LoRA on 4-bit quantized models (fits 7B models on consumer GPUs!)
Adapter merging — combining LoRA weights back into the base model
Memory savings — LoRA trains ~0.1–1% of parameters

The Math of LoRA

Original: h = Wx                   (d × d matrix, d² parameters)
LoRA:     h = Wx + BAx             (B: d×r, A: r×d, only 2dr parameters)
                                    With r=16, d=4096: 131K vs 16.7M params (125× reduction)

🔨 Project: QLoRA Fine-Tune a 1.7B Model on Consumer GPU

Load HuggingFaceTB/SmolLM2-1.7B in 4-bit
Add LoRA adapters with peft
Train on a domain-specific dataset (e.g., coding, medical, legal)
Compare generation quality: base vs LoRA-tuned
Merge adapters and push the merged model to Hub

Week 15: Preference Alignment — DPO & RLHF

Study Material

Resource	Topic	Time	Link
smol-course: Module 2 — Preference Alignment	DPO training hands-on	~3 hrs	GitHub
Paper: "Direct Preference Optimization" (Rafailov et al.)	DPO — RLHF without a reward model	~2 hrs	arxiv:2305.18290
Paper: "Training language models to follow instructions" (InstructGPT)	Original RLHF pipeline	~2 hrs	arxiv:2203.02155
TRL DPO Documentation	`DPOTrainer`, `DPOConfig`	~1 hr	HF TRL Docs

Key Concepts to Master

The alignment problem — why SFT alone isn't enough
Human preferences — "which response is better?" → preference data
RLHF pipeline: SFT → Reward Model → PPO optimization
DPO — bypasses the reward model, optimizes preferences directly
The DPO loss: L_DPO = -log σ(β · (log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))
β parameter — controls deviation from reference policy
Preference dataset format — prompt, chosen, rejected columns
KL divergence — preventing the model from straying too far

Dataset Format for DPO

{
  "prompt": "Explain quantum computing simply.",
  "chosen": "Quantum computing uses qubits that can be 0, 1, or both at once...",
  "rejected": "Quantum computing is a type of computing that uses quantum mechanics..."
}

🔨 Project: Align a Model with DPO

Start from your SFT model (Week 13)
Load a preference dataset
Train with DPOTrainer
Compare: base → SFT → DPO outputs side by side
Evaluate helpfulness and safety improvements

Week 16: Reasoning Models — GRPO & Open R1

Study Material

Resource	Topic	Time	Link
HF NLP Course: Chapter 12 — Reasoning Models	GRPO, DeepSeek R1, Open R1 project	~4 hrs	HF Course Ch.12
Paper: "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning"	The breakthrough reasoning model	~3 hrs	arxiv:2501.12948 — HF Paper
Deep RL Course: Units 1–4	RL fundamentals, policy gradients, PPO	~6 hrs	HF Deep RL Course

Key Concepts to Master

Why RL for reasoning — SFT on reasoning traces vs discovering reasoning via RL
DeepSeek-R1-Zero — reasoning emerges purely from RL (no SFT)
Group Relative Policy Optimization (GRPO) — reward relative to group average
Reward functions — correctness checkers, format validators
Chain-of-thought — step-by-step reasoning in <think> tags
Cold-start data — bootstrapping reasoning with SFT before RL
Multi-stage training: SFT → RL (reasoning) → SFT (readability) → RL (all tasks)
GRPOTrainer from TRL — practical implementation

The GRPO Algorithm (Simplified)

For each prompt x:
  1. Generate K responses: {y₁, y₂, ..., yₖ}
  2. Score each: {r₁, r₂, ..., rₖ}
  3. Compute group advantage: Aᵢ = (rᵢ - mean(r)) / std(r)
  4. Update policy to increase probability of high-advantage responses
  5. Apply KL penalty to stay close to reference model

🔨 Project: Train a Math Reasoning Model with GRPO

Start from a small instruct model
Prepare a math dataset with verifiable answers (GSM8K format)
Define reward functions (correctness + format)
Train with GRPOTrainer
Evaluate on held-out math problems
Observe the model learn to use <think> reasoning

Phase 6: Advanced Topics & Capstone (Weeks 17–20)

🎯 Goal

Evaluation, synthetic data, deployment, agents, and a capstone project.

Week 17: Evaluation & Benchmarks

Study Material

Resource	Topic	Time	Link
smol-course: Module 4 — Evaluation	LLM evaluation, `lighteval`	~3 hrs	GitHub
HF `lighteval` Documentation	Running standardized benchmarks	~2 hrs	HF lighteval Docs
Open LLM Leaderboard	Understanding LLM benchmarks	~1 hr	HF Leaderboard

Key Concepts to Master

Perplexity — exp(average cross-entropy loss) — lower is better
MMLU — Massive Multitask Language Understanding (knowledge test)
GSM8K — Grade School Math 8K (reasoning benchmark)
HumanEval — coding benchmark (pass@k)
TruthfulQA — measuring hallucination tendencies
LLM-as-judge — using a strong LLM to evaluate a weaker one
Contamination — when evaluation data leaks into training data

🔨 Project: Benchmark Your Models

Run lighteval on all models you've trained so far. Create a comparison table:

Model	Perplexity	MMLU	GSM8K	Notes
Base SmolLM2	...	...	...	Pretrained
+ SFT	...	...	...	Week 13
+ LoRA	...	...	...	Week 14
+ DPO	...	...	...	Week 15
+ GRPO	...	...	...	Week 16

Week 18: Synthetic Data & Inference Optimization

Study Material

Resource	Topic	Time	Link
smol-course: Module 6 — Synthetic Data	Generating training data with LLMs	~3 hrs	GitHub
smol-course: Module 7 — Inference	Quantization, vLLM, optimization	~3 hrs	GitHub
HF `bitsandbytes` Documentation	4-bit and 8-bit quantization	~1 hr	HF BnB Docs
HF `distilabel` Documentation	Synthetic data pipelines	~2 hrs	HF Distilabel Docs

Key Concepts to Master

Knowledge distillation — training a small model on a large model's outputs
Synthetic data generation — using LLMs to create training data
Data quality filtering — removing bad synthetic examples
Quantization (INT8, INT4, GPTQ, AWQ) — reducing model size
KV-cache — speeding up autoregressive generation
Batched inference — processing multiple requests efficiently
vLLM — high-throughput inference server
Speculative decoding — using a small model to speed up a large model

Week 19: AI Agents & Tool Use

Study Material

Resource	Topic	Time	Link
HF Agents Course: Units 0–2	Agent fundamentals, frameworks	~6 hrs	HF Agents Course
smol-course: Module 8 — Agents	Building agents with smolagents	~3 hrs	GitHub
HF `smolagents` Documentation	Lightweight agent framework	~2 hrs	HF smolagents Docs

Key Concepts to Master

Agent loop: Thought → Action → Observation → repeat
Tool definition — giving LLMs access to functions
Function calling format — how models invoke tools
ReAct pattern — reasoning + acting in alternation
RAG (Retrieval Augmented Generation) — grounding in external knowledge
Code agents — generating and executing Python code
Multi-agent systems — agents collaborating or delegating

🔨 Project: Build a RAG Agent

Create a knowledge base from a set of documents
Build a retrieval tool using sentence embeddings
Wire up a smolagents agent that answers questions using retrieval
Deploy as a Gradio app on HF Spaces

Week 20: Capstone Project 🎓

Choose One (or combine multiple):

Option A: Train and Deploy a Domain Expert Model

Collect/curate a domain-specific dataset (medical, legal, code, science)
SFT a base model on your dataset
Apply LoRA for efficient training
Evaluate on domain-specific benchmarks
Deploy as an API + Gradio demo on HF Spaces
Write a detailed model card

Option B: Build a Reasoning Model for Math

Generate synthetic math reasoning data
SFT a model on chain-of-thought examples
Apply GRPO to improve reasoning quality
Evaluate on GSM8K and MATH benchmarks
Compare SFT-only vs SFT+GRPO
Write a blog post about your findings

Option C: Create an End-to-End Agent System

Fine-tune a model for function calling
Build a multi-tool agent (web search, calculator, code execution)
Add RAG with a custom knowledge base
Evaluate agent performance on tasks
Deploy as a Space with observability/logging
Write documentation for others to use

Deliverables

Trained model(s) on Hugging Face Hub
Model card with training details, evaluations, limitations
Gradio demo deployed on HF Spaces
Blog post / README documenting your journey

Reading List: Landmark Papers

These are the papers that defined the field. Read them in this order as you progress through the plan.

Tier 1: Must Read (referenced directly in the plan)

#	Paper	Year	Key Contribution	Link
1	Attention Is All You Need (Vaswani et al.)	2017	Transformer architecture	arxiv:1706.03762
2	BERT (Devlin et al.)	2018	Bidirectional pretraining, MLM	arxiv:1810.04805
3	Language Models are Few-Shot Learners (GPT-3)	2020	In-context learning, scaling	arxiv:2005.14165
4	Training Compute-Optimal LLMs (Chinchilla)	2022	Scaling laws, data vs params	arxiv:2203.15556
5	Training LMs to Follow Instructions (InstructGPT)	2022	RLHF pipeline	arxiv:2203.02155
6	LoRA (Hu et al.)	2021	Parameter-efficient fine-tuning	arxiv:2106.09685
7	Direct Preference Optimization (Rafailov et al.)	2023	DPO — alignment without RL	arxiv:2305.18290
8	DeepSeek-R1 (DeepSeek AI)	2025	GRPO, reasoning via RL	arxiv:2501.12948

Tier 2: Highly Recommended (deepens understanding)

#	Paper	Year	Key Contribution	Link
9	Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al.)	2014	Attention mechanism (before Transformers)	arxiv:1409.0473
10	A Neural Probabilistic Language Model (Bengio et al.)	2003	Word embeddings, neural LMs	jmlr.org
11	LLaMA: Open and Efficient Foundation Language Models (Touvron et al.)	2023	Efficient open-source LLM recipe	arxiv:2302.13971
12	QLoRA (Dettmers et al.)	2023	4-bit fine-tuning on consumer GPUs	arxiv:2305.14314
13	SmolLM2 (Allal et al.)	2025	Small model training recipe	arxiv:2502.02737
14	Chain-of-Thought Prompting (Wei et al.)	2022	Step-by-step reasoning	arxiv:2201.11903

Tier 3: Reference (consult when needed)

#	Paper	Year	Key Contribution	Link
15	FlashAttention (Dao et al.)	2022	Memory-efficient attention	arxiv:2205.14135
16	RoFormer / RoPE (Su et al.)	2021	Rotary position embeddings	arxiv:2104.09864
17	GQA: Grouped Query Attention (Ainslie et al.)	2023	Efficient attention variant	arxiv:2305.13245
18	Scaling Laws for Neural LMs (Kaplan et al.)	2020	Power-law scaling	arxiv:2001.08361

All Resources & Links

📚 Courses

Course	Provider	URL
NLP / LLM Course (12 chapters)	Hugging Face	huggingface.co/learn/nlp-course
smol-course (8 modules)	Hugging Face	github.com/huggingface/smol-course
Deep RL Course	Hugging Face	huggingface.co/learn/deep-rl-course
AI Agents Course	Hugging Face	huggingface.co/learn/agents-course
Neural Networks: Zero to Hero	Andrej Karpathy	YouTube Playlist
Practical Deep Learning	fast.ai	course.fast.ai

🎬 Key Videos

Video	Creator	Duration	Link
Let's Build GPT	Karpathy	~2 hrs	youtu.be/kCc8FmEb1nY
Let's Build the GPT Tokenizer	Karpathy	~2 hrs	YouTube
Reproducing GPT-2 (124M)	Karpathy	~4 hrs	YouTube
Attention in Transformers	3Blue1Brown	~45 min	YouTube
Essence of Linear Algebra	3Blue1Brown	~3.5 hrs	YouTube Playlist

🛠️ Code Repositories

Repo	Description	Link
nanoGPT	Simplest GPT training code	github.com/karpathy/nanoGPT
minGPT	Educational GPT implementation	github.com/karpathy/minGPT
makemore	Character LM progression	github.com/karpathy/makemore
micrograd	Autograd engine from scratch	github.com/karpathy/micrograd

📦 HF Libraries

Library	Purpose	Docs
`transformers`	Model loading, inference, training	huggingface.co/docs/transformers
`datasets`	Data loading & processing	huggingface.co/docs/datasets
`tokenizers`	Fast tokenizer training	huggingface.co/docs/tokenizers
`trl`	SFT, DPO, GRPO trainers	huggingface.co/docs/trl
`peft`	LoRA, QLoRA, adapters	huggingface.co/docs/peft
`accelerate`	Distributed & mixed-precision training	huggingface.co/docs/accelerate
`evaluate`	Metrics & evaluation	huggingface.co/docs/evaluate
`lighteval`	LLM benchmarking	huggingface.co/docs/lighteval
`smolagents`	Agent framework	huggingface.co/docs/smolagents
`gradio`	ML demos	gradio.app
`bitsandbytes`	Quantization	huggingface.co/docs/bitsandbytes
`distilabel`	Synthetic data	huggingface.co/docs/distilabel

🤖 Models to Use

Model	Size	Use Case	Link
SmolLM2-135M	135M	Learning, fast experiments	HuggingFaceTB/SmolLM2-135M
SmolLM2-360M	360M	Small model training	HuggingFaceTB/SmolLM2-360M
SmolLM2-1.7B	1.7B	QLoRA on consumer GPU	HuggingFaceTB/SmolLM2-1.7B
GPT-2	124M	Reference implementation	openai-community/gpt2
DistilBERT	66M	Classification tasks	distilbert-base-uncased

📊 Datasets to Use

Dataset	Format	Use Case	Link
smoltalk	ChatML (`messages`)	SFT training	HuggingFaceTB/smoltalk
Tiny Shakespeare	Raw text	nanoGPT pretraining	Included in nanoGPT repo
TinyStories	Text	Small model pretraining	roneneldan/TinyStories
IMDb	Classification	Fine-tuning practice	stanfordnlp/imdb
GSM8K	Math QA	Reasoning evaluation	openai/gsm8k

Appendix: Glossary of Key Terms

Term	Definition
Attention	Mechanism that lets each token attend to all other tokens, computing relevance-weighted representations. Core formula: `softmax(QK^T/√d_k)V`
Autoregressive	Generating one token at a time, each conditioned on all previous tokens. GPT-style models.
Backpropagation	Algorithm for computing gradients of the loss w.r.t. all parameters by applying the chain rule backward through the computation graph.
BPE (Byte Pair Encoding)	Tokenization algorithm that iteratively merges the most frequent pair of tokens. Used by GPT-2, GPT-3, LLaMA.
Causal Mask	Lower-triangular mask that prevents tokens from attending to future positions. Makes the model autoregressive.
ChatML	Standard format for chat data: list of `{role, content}` dictionaries with roles `system`, `user`, `assistant`.
Cross-Entropy Loss	Standard loss for classification/language modeling: `-Σ yᵢ log(ŷᵢ)`. Measures how well predicted distribution matches target.
DPO (Direct Preference Optimization)	Alignment method that directly optimizes the policy from preference pairs, without training a separate reward model.
Embedding	Dense vector representation of a discrete token. Learned lookup table mapping token IDs to vectors.
Fine-Tuning	Continuing training of a pretrained model on a specific downstream task or dataset.
GRPO (Group Relative Policy Optimization)	RL algorithm that updates the policy based on relative advantage within a group of sampled responses. Used by DeepSeek-R1.
Gradient Accumulation	Simulating large batch sizes by accumulating gradients over multiple forward/backward passes before updating weights.
KV-Cache	Caching key and value tensors from previous tokens during autoregressive generation, avoiding recomputation.
Layer Normalization	Normalizing activations across the feature dimension (not the batch dimension). Stabilizes Transformer training.
LoRA	Adding small low-rank matrices (B×A where rank r << d) to existing weight matrices. Trains ~0.1% of parameters.
Perplexity	`exp(cross-entropy loss)`. Intuitively: how many tokens the model is "confused" between. Lower = better.
Positional Encoding	Information added to token embeddings so the model knows the order of tokens. Sinusoidal (original) or learned (GPT-2).
Pretraining	Initial training on a large unlabeled corpus (next-token prediction). Creates the base model.
QLoRA	LoRA applied to a 4-bit quantized base model. Enables fine-tuning 65B models on a single 48GB GPU.
Quantization	Reducing numerical precision (fp32 → fp16 → int8 → int4) to reduce model size and speed up inference.
Residual Connection	`output = x + f(x)`. Allows gradients to flow directly through the network, enabling very deep models.
RLHF	Reinforcement Learning from Human Feedback. Pipeline: SFT → Reward Model → PPO. Original alignment method (InstructGPT).
Scaling Laws	Empirical finding that LM loss follows a power law: `L(N) ∝ N^(-α)`. Predicts performance from compute budget.
Self-Attention	Attention where queries, keys, and values all come from the same sequence. Each token attends to all tokens in the sequence.
SFT (Supervised Fine-Tuning)	Fine-tuning on instruction-response pairs. Transforms base models into helpful assistants.
Softmax	`softmax(xᵢ) = exp(xᵢ) / Σ exp(xⱼ)`. Converts raw scores (logits) to a probability distribution.
Temperature	Scaling factor applied to logits before softmax during generation. Higher = more random, lower = more deterministic.
Token	The atomic unit of text for the model. Can be a character, subword, or word depending on the tokenizer.
Transformer	Neural network architecture based on self-attention, introduced in "Attention Is All You Need" (2017). Foundation of all modern LLMs.

Progress Tracker

Use this checklist to track your progress:

Phase 1: Foundations ☐

Week 1: Linear algebra & calculus videos complete
Week 1: Implemented matmul, softmax, cross_entropy from scratch
Week 2: Watched 3B1B neural networks series
Week 2: Built micrograd (autograd engine)
Week 3: Completed PyTorch 60-min blitz
Week 3: Built bigram + MLP language models (makemore Parts 1–2)

Phase 2: Transformer Architecture ☐

Week 4: Completed makemore Parts 3–5
Week 4: Can manually backpropagate through a small network
Week 5: Read "Attention Is All You Need" (all of §3)
Week 5: Can draw the full Transformer architecture from memory
Week 6: Watched "Let's Build GPT" and implemented along
Week 6: Trained a working GPT on Shakespeare that generates text

Phase 3: Language Modeling ☐

Week 7: Implemented BPE from scratch
Week 7: Trained a HuggingFace tokenizer on custom data
Week 8: Read Chinchilla & GPT-3 papers
Week 8: Can calculate FLOPs and training time for a given model size
Week 9: Pretrained a small GPT (10M–50M params)
Week 9: Pushed model to Hugging Face Hub

Phase 4: HF Ecosystem ☐

Week 10: Loaded and ran 5 different models via pipeline()
Week 11: Fine-tuned a text classifier with Trainer
Week 11: Model pushed to Hub
Week 12: Deployed a Gradio demo on HF Spaces

Phase 5: Fine-Tuning & Alignment ☐

Week 13: SFT'd SmolLM2 into a chat model
Week 14: Applied QLoRA to a 1.7B model
Week 15: Trained a DPO-aligned model
Week 16: Trained a GRPO reasoning model

Phase 6: Advanced ☐

Week 17: Benchmarked all models with lighteval
Week 18: Generated synthetic data, quantized a model
Week 19: Built a RAG agent with smolagents
Week 20: Completed capstone project

"The best way to understand LLMs is to build one from scratch. The second best way is to train one. The third best way is to read the papers. Do all three."

Good luck on your journey! 🚀

🧠 Learn LLMs from First Principles — A Complete Study Plan

Table of Contents

Overview & Philosophy

Phase 1: Mathematical & Neural Network Foundations (Weeks 1–3)

🎯 Goal

Week 1: Linear Algebra, Calculus & Probability

Study Material

Key Concepts to Master

📝 Exercise

Week 2: Neural Networks from Scratch

Study Material

Key Concepts to Master

🔨 Project: Build micrograd

Week 3: PyTorch Fundamentals & MLPs

Study Material

Key Concepts to Master

🔨 Project: Character-level Language Model

Phase 2: The Transformer Architecture from Scratch (Weeks 4–6)

🎯 Goal

Week 4: Deep Dive into Training Dynamics

Study Material

Key Concepts to Master

📝 Exercise

Week 5: The Transformer — Theory & Paper Reading

Study Material

Key Sections of "Attention Is All You Need" to Study

Key Concepts to Master

📝 Exercise

Week 6: Build GPT from Scratch

Study Material

What You'll Build (following the video)

Key Concepts to Master

🔨 Project: Your Own nanoGPT

Phase 3: Language Modeling & Pretraining (Weeks 7–9)

🎯 Goal

Week 7: Tokenization — The Unsung Hero

Study Material

Key Concepts to Master

🔨 Project: Train Your Own BPE Tokenizer

Week 8: Scaling Laws & Pretraining Concepts

Study Material

Key Concepts to Master

📝 Exercise

Week 9: Pretraining a Small Model End-to-End

Study Material

Key Concepts to Master

🔨 Project: Pretrain a 10M–50M Parameter GPT

Phase 4: The Hugging Face Ecosystem (Weeks 10–12)

🎯 Goal

Week 10: Transformers Library & the Hub

Study Material

Key Concepts to Master

🔨 Mini-Project

Week 11: Datasets, Fine-Tuning & Debugging

Study Material

Key Concepts to Master

🔨 Project: Fine-Tune a Text Classifier

Week 12: Demos, Data Annotation & the Full Workflow

Study Material

Key Concepts to Master

🔨 Project: Deploy a Model on HF Spaces

Phase 5: Fine-Tuning & Alignment (Weeks 13–16)

🎯 Goal

Week 13: Supervised Fine-Tuning (SFT) & Chat Models

Study Material

Key Concepts to Master

Dataset Format for SFT

🔨 Project: Fine-Tune SmolLM2 into a Chat Model

Week 14: Parameter-Efficient Fine-Tuning (LoRA / QLoRA)

Study Material

Key Concepts to Master

The Math of LoRA

🔨 Project: QLoRA Fine-Tune a 1.7B Model on Consumer GPU

Week 15: Preference Alignment — DPO & RLHF

Study Material

Key Concepts to Master

Dataset Format for DPO

🔨 Project: Align a Model with DPO

Week 16: Reasoning Models — GRPO & Open R1

Study Material