Instructions to use quanghuynt14/LLM-from-First-Principles with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
The messages format β structured conversation data
SFTTrainer from TRL β the standard fine-tuning trainer
Dataset preparation β converting raw text to ChatML format
Packing β fitting multiple short examples in one sequence
Evaluation β loss curves, manual quality checks, benchmarks
Base model β Instruct model transformation
Dataset Format for SFT
{"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"What is the capital of France?"},{"role":"assistant","content":"The capital of France is Paris."}]}
Ξ² parameter β controls deviation from reference policy
Preference dataset format β prompt, chosen, rejected columns
KL divergence β preventing the model from straying too far
Dataset Format for DPO
{"prompt":"Explain quantum computing simply.","chosen":"Quantum computing uses qubits that can be 0, 1, or both at once...","rejected":"Quantum computing is a type of computing that uses quantum mechanics..."}
π¨ Project: Align a Model with DPO
Start from your SFT model (Week 13)
Load a preference dataset
Train with DPOTrainer
Compare: base β SFT β DPO outputs side by side
For each prompt x:
1. Generate K responses: {yβ, yβ, ..., yβ}
2. Score each: {rβ, rβ, ..., rβ}
3. Compute group advantage: Aα΅’ = (rα΅’ - mean(r)) / std(r)
4. Update policy to increase probability of high-advantage responses
5. Apply KL penalty to stay close to reference model
π¨ Project: Train a Math Reasoning Model with GRPO
Start from a small instruct model
Prepare a math dataset with verifiable answers (GSM8K format)
Mechanism that lets each token attend to all other tokens, computing relevance-weighted representations. Core formula: softmax(QK^T/βd_k)V
Autoregressive
Generating one token at a time, each conditioned on all previous tokens. GPT-style models.
Backpropagation
Algorithm for computing gradients of the loss w.r.t. all parameters by applying the chain rule backward through the computation graph.
BPE (Byte Pair Encoding)
Tokenization algorithm that iteratively merges the most frequent pair of tokens. Used by GPT-2, GPT-3, LLaMA.
Causal Mask
Lower-triangular mask that prevents tokens from attending to future positions. Makes the model autoregressive.
ChatML
Standard format for chat data: list of {role, content} dictionaries with roles system, user, assistant.
Cross-Entropy Loss
Standard loss for classification/language modeling: -Ξ£ yα΅’ log(Ε·α΅’). Measures how well predicted distribution matches target.
DPO (Direct Preference Optimization)
Alignment method that directly optimizes the policy from preference pairs, without training a separate reward model.
Embedding
Dense vector representation of a discrete token. Learned lookup table mapping token IDs to vectors.
Fine-Tuning
Continuing training of a pretrained model on a specific downstream task or dataset.
GRPO (Group Relative Policy Optimization)
RL algorithm that updates the policy based on relative advantage within a group of sampled responses. Used by DeepSeek-R1.
Gradient Accumulation
Simulating large batch sizes by accumulating gradients over multiple forward/backward passes before updating weights.
KV-Cache
Caching key and value tensors from previous tokens during autoregressive generation, avoiding recomputation.
Layer Normalization
Normalizing activations across the feature dimension (not the batch dimension). Stabilizes Transformer training.
LoRA
Adding small low-rank matrices (BΓA where rank r << d) to existing weight matrices. Trains ~0.1% of parameters.
Perplexity
exp(cross-entropy loss). Intuitively: how many tokens the model is "confused" between. Lower = better.
Positional Encoding
Information added to token embeddings so the model knows the order of tokens. Sinusoidal (original) or learned (GPT-2).
Pretraining
Initial training on a large unlabeled corpus (next-token prediction). Creates the base model.
QLoRA
LoRA applied to a 4-bit quantized base model. Enables fine-tuning 65B models on a single 48GB GPU.
Quantization
Reducing numerical precision (fp32 β fp16 β int8 β int4) to reduce model size and speed up inference.
Residual Connection
output = x + f(x). Allows gradients to flow directly through the network, enabling very deep models.
RLHF
Reinforcement Learning from Human Feedback. Pipeline: SFT β Reward Model β PPO. Original alignment method (InstructGPT).
Scaling Laws
Empirical finding that LM loss follows a power law: L(N) β N^(-Ξ±). Predicts performance from compute budget.
Self-Attention
Attention where queries, keys, and values all come from the same sequence. Each token attends to all tokens in the sequence.
SFT (Supervised Fine-Tuning)
Fine-tuning on instruction-response pairs. Transforms base models into helpful assistants.
Softmax
softmax(xα΅’) = exp(xα΅’) / Ξ£ exp(xβ±Ό). Converts raw scores (logits) to a probability distribution.
Temperature
Scaling factor applied to logits before softmax during generation. Higher = more random, lower = more deterministic.
Token
The atomic unit of text for the model. Can be a character, subword, or word depending on the tokenizer.
Transformer
Neural network architecture based on self-attention, introduced in "Attention Is All You Need" (2017). Foundation of all modern LLMs.
Progress Tracker
Use this checklist to track your progress:
Phase 1: Foundations β
Week 1: Linear algebra & calculus videos complete
Week 1: Implemented matmul, softmax, cross_entropy from scratch
Week 2: Watched 3B1B neural networks series
Week 2: Built micrograd (autograd engine)
Week 3: Completed PyTorch 60-min blitz
Week 3: Built bigram + MLP language models (makemore Parts 1β2)
Phase 2: Transformer Architecture β
Week 4: Completed makemore Parts 3β5
Week 4: Can manually backpropagate through a small network
Week 5: Read "Attention Is All You Need" (all of Β§3)
Week 5: Can draw the full Transformer architecture from memory
Week 6: Watched "Let's Build GPT" and implemented along
Week 6: Trained a working GPT on Shakespeare that generates text
Phase 3: Language Modeling β
Week 7: Implemented BPE from scratch
Week 7: Trained a HuggingFace tokenizer on custom data
Week 8: Read Chinchilla & GPT-3 papers
Week 8: Can calculate FLOPs and training time for a given model size
Week 9: Pretrained a small GPT (10Mβ50M params)
Week 9: Pushed model to Hugging Face Hub
Phase 4: HF Ecosystem β
Week 10: Loaded and ran 5 different models via pipeline()
Week 11: Fine-tuned a text classifier with Trainer
Week 11: Model pushed to Hub
Week 12: Deployed a Gradio demo on HF Spaces
Phase 5: Fine-Tuning & Alignment β
Week 13: SFT'd SmolLM2 into a chat model
Week 14: Applied QLoRA to a 1.7B model
Week 15: Trained a DPO-aligned model
Week 16: Trained a GRPO reasoning model
Phase 6: Advanced β
Week 17: Benchmarked all models with lighteval
Week 18: Generated synthetic data, quantized a model
Week 19: Built a RAG agent with smolagents
Week 20: Completed capstone project
"The best way to understand LLMs is to build one from scratch. The second best way is to train one. The third best way is to read the papers. Do all three."