{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 05. Loss Debugging (5-Level Diagnostic Framework)\n", "\n", "A framework for systematically diagnosing why loss is not decreasing as expected during training.\n", "\n", "**Always check from the lowest level first** — find data bugs before tuning hyperparameters.\n", "\n", "```\n", "Level 0: Status Diagnosis ← Classify current training state (6 types)\n", "Level 1: Data / Implementation ← Most common cause (70%)\n", "Level 2: Numerical Stability ← NaN/Inf, activation explosion\n", "Level 3: Hyperparameters ← LR, batch size, warmup\n", "Level 4: Fitting Diagnosis ← overfitting vs underfitting\n", "Level 5: Architecture ← initialization, per-layer activation\n", "```" ], "id": "19f7e954" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# No additional packages required\n", "# LossDebugger only uses torch and built-in llm_lab modules" ], "id": "af6a605f" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "try:\n", " import google.colab\n", " from google.colab import drive\n", " drive.mount('/content/drive')\n", " project_path = '/content/drive/MyDrive/Colab Notebooks/LLM-1B-Lab'\n", " sys.path.append(project_path)\n", "except ImportError:\n", " sys.path.insert(0, '..')\n", "\n", "import math\n", "import torch\n", "\n", "from llm_lab.config import ModelConfig, DataConfig, TrainConfig\n", "from llm_lab.model import LLMModel\n", "from llm_lab.data import setup_data_pipeline\n", "from llm_lab.training import LossDebugger\n", "from llm_lab.utils import auto_configure, get_device" ], "id": "23b92c8d" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Configuration\n", "\n", "Use the `debug_10m` preset so it runs quickly even on CPU." ], "id": "f3daf6ad" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# --- Config ---\n", "model_config = ModelConfig.base_1b()\n", "data_config = DataConfig(\n", " max_seq_len=model_config.max_seq_len,\n", " batch_size=4,\n", ")\n", "train_config = TrainConfig.base_1b()\n", "\n", "# --- Device / dtype ---\n", "train_config = auto_configure(train_config)\n", "device = get_device()\n", "dtype = train_config.torch_dtype\n", "\n", "vocab_size = data_config.vocab_size\n", "print(f\"Device: {device}, dtype: {dtype}\")\n", "print(f\"Vocab size: {vocab_size:,}\")\n", "print(f\"Expected initial loss: ln({vocab_size}) = {math.log(vocab_size):.2f}\")" ], "id": "dc7aa763" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# --- Model ---\n", "model = LLMModel(model_config).to(device)\n", "print(f\"Model parameters: {model.count_parameters():,}\")\n", "\n", "# --- Data pipeline (GPT-2 tokenizer used automatically) ---\n", "tokenizer, train_dl, val_dl = setup_data_pipeline(config=data_config)" ], "id": "6ed5cdad" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 0.1 Training History (Mock Metrics History)\n", "\n", "A mock `metrics_history` is provided so you can test Level 0 / 3 / 4 / scenario detection\n", "without running actual training.\n", "\n", "After actual training, use `trainer.metrics.history` instead:\n", "```python\n", "# metrics_history = trainer.metrics.history\n", "```" ], "id": "58a40e59" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import random\n", "random.seed(42)\n", "\n", "expected_initial = math.log(vocab_size) # ~10.37\n", "\n", "# --- Scenario A: loss stuck near ln(vocab_size) ---\n", "# loss barely decreasing (suspected data/implementation bug)\n", "# diagnose_status condition: loss_change < 0.1\n", "n_steps_a = 200\n", "mock_history_a = {\n", " \"step\": list(range(1, n_steps_a + 1)),\n", " \"train_loss\": [expected_initial - 0.0002 * i + random.uniform(-0.02, 0.02)\n", " for i in range(n_steps_a)],\n", " \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_a)],\n", " \"grad_norm\": [random.uniform(0.3, 0.8) for _ in range(n_steps_a)],\n", " \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_a)],\n", " \"gpu_mem_gb\": [random.uniform(1.0, 2.0) for _ in range(n_steps_a)],\n", " \"val_loss\": [expected_initial + random.uniform(-0.1, 0.1)\n", " for _ in range(0, n_steps_a, 50)],\n", " \"val_ppl\": [math.exp(expected_initial) + random.uniform(-50, 50)\n", " for _ in range(0, n_steps_a, 50)],\n", "}\n", "print(f\"Mock A — train_loss: {mock_history_a['train_loss'][0]:.2f} -> {mock_history_a['train_loss'][-1]:.2f}\")\n", "print(f\" Expected: NO_DECREASE (loss barely changes)\")\n", "\n", "# --- Scenario B: loss decreasing then NaN ---\n", "n_steps_b = 200\n", "mock_history_b = {\n", " \"step\": list(range(1, n_steps_b + 1)),\n", " \"train_loss\": [expected_initial - 0.03 * i + random.uniform(-0.05, 0.05)\n", " for i in range(150)]\n", " + [float('nan')] * 50,\n", " \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_b)],\n", " \"grad_norm\": [random.uniform(0.3, 0.8) for _ in range(145)]\n", " + [random.uniform(5.0, 50.0) for _ in range(5)]\n", " + [float('nan')] * 50,\n", " \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_b)],\n", " \"gpu_mem_gb\": [random.uniform(1.0, 2.0) for _ in range(n_steps_b)],\n", " \"val_loss\": [expected_initial - 0.03 * i + random.uniform(-0.1, 0.1)\n", " for i in range(0, n_steps_b, 50)],\n", " \"val_ppl\": [math.exp(expected_initial - 0.03 * i)\n", " for i in range(0, n_steps_b, 50)],\n", "}\n", "print(f\"\\nMock B — train_loss starts normal, then NaN at step ~150\")\n", "print(f\" Expected: Scenario B (NaN detected)\")\n", "\n", "# --- Scenario C: loss decreased then increased ---\n", "n_steps_c = 200\n", "_c_val_losses = (\n", " [expected_initial - 0.02 * i + random.uniform(-0.1, 0.1)\n", " for i in range(0, 120, 50)]\n", " + [expected_initial - 0.02 * 120 + 0.03 * (i - 120) + random.uniform(-0.1, 0.1)\n", " for i in range(120, n_steps_c, 50)]\n", ")\n", "mock_history_c = {\n", " \"step\": list(range(1, n_steps_c + 1)),\n", " \"train_loss\": [expected_initial - 0.04 * i + random.uniform(-0.05, 0.05)\n", " for i in range(120)]\n", " + [expected_initial - 0.04 * 120 + 0.02 * (i - 120) + random.uniform(-0.05, 0.05)\n", " for i in range(120, n_steps_c)],\n", " \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_c)],\n", " \"grad_norm\": [random.uniform(0.3, 0.8) for _ in range(n_steps_c)],\n", " \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_c)],\n", " \"gpu_mem_gb\": [random.uniform(1.0, 2.0) for _ in range(n_steps_c)],\n", " \"val_loss\": _c_val_losses,\n", " \"val_ppl\": [math.exp(v) for v in _c_val_losses],\n", "}\n", "print(f\"\\nMock C — train_loss: decrease → increase (bounce)\")\n", "print(f\" Expected: Scenario C (loss recovery)\")\n", "\n", "# --- Scenario D: loss stuck at high value (4.0) ---\n", "n_steps_d = 200\n", "mock_history_d = {\n", " \"step\": list(range(1, n_steps_d + 1)),\n", " \"train_loss\": [4.0 + random.uniform(-0.1, 0.1) for _ in range(n_steps_d)],\n", " \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_d)],\n", " \"grad_norm\": [random.uniform(0.1, 0.3) for _ in range(n_steps_d)],\n", " \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_d)],\n", " \"gpu_mem_gb\": [random.uniform(1.0, 2.0) for _ in range(n_steps_d)],\n", " \"val_loss\": [4.2 + random.uniform(-0.1, 0.1)\n", " for _ in range(0, n_steps_d, 50)],\n", " \"val_ppl\": [math.exp(4.2) + random.uniform(-5, 5)\n", " for _ in range(0, n_steps_d, 50)],\n", "}\n", "print(f\"\\nMock D — train_loss stuck at ~4.0\")\n", "print(f\" Expected: Scenario D (plateau at high value)\")\n", "\n", "# --- Normal: loss decreasing normally ---\n", "n_steps_n = 200\n", "mock_history_normal = {\n", " \"step\": list(range(1, n_steps_n + 1)),\n", " \"train_loss\": [expected_initial - 0.03 * i + random.uniform(-0.05, 0.05)\n", " for i in range(n_steps_n)],\n", " \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_n)],\n", " \"grad_norm\": [random.uniform(0.3, 0.8) for _ in range(n_steps_n)],\n", " \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_n)],\n", " \"gpu_mem_gb\": [random.uniform(1.0, 2.0) for _ in range(n_steps_n)],\n", " \"val_loss\": [expected_initial - 0.03 * i + random.uniform(-0.1, 0.1)\n", " for i in range(0, n_steps_n, 50)],\n", " \"val_ppl\": [math.exp(expected_initial - 0.03 * i)\n", " for i in range(0, n_steps_n, 50)],\n", "}\n", "print(f\"\\nMock Normal — train_loss: {mock_history_normal['train_loss'][0]:.2f} -> {mock_history_normal['train_loss'][-1]:.2f}\")\n", "print(f\" Expected: NORMAL (loss decreasing steadily)\")\n", "\n", "# --- Scenario E: loss diverging (increasing beyond initial) ---\n", "# diagnose_status condition: last_loss > expected_initial + 1.0\n", "# NO_DECREASE bypass: first_loss(5.0) < expected_initial - 2.0(8.37)\n", "n_steps_e = 200\n", "mock_history_e = {\n", " \"step\": list(range(1, n_steps_e + 1)),\n", " \"train_loss\": [5.0 + (7.0 / 199) * i + random.uniform(-0.1, 0.1)\n", " for i in range(n_steps_e)],\n", " \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_e)],\n", " \"grad_norm\": [random.uniform(0.5, 2.0) for _ in range(n_steps_e)],\n", " \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_e)],\n", " \"gpu_mem_gb\": [random.uniform(1.0, 2.0) for _ in range(n_steps_e)],\n", " \"val_loss\": [5.5 + (6.0 / 3) * i + random.uniform(-0.1, 0.1)\n", " for i in range(4)],\n", " \"val_ppl\": [math.exp(5.5 + (6.0 / 3) * i)\n", " for i in range(4)],\n", "}\n", "print(f\"\\nMock E — train_loss: {mock_history_e['train_loss'][0]:.2f} -> {mock_history_e['train_loss'][-1]:.2f}\")\n", "print(f\" Expected: DIVERGING (loss exceeds initial value)\")\n", "\n", "# --- Scenario F: unstable training (large spikes in recent steps) ---\n", "# diagnose_status condition: recent_std > 0.5 * recent_mean (last 50 steps)\n", "# NO_DECREASE bypass: loss_change > 0.1 (10.0 -> ~4.0)\n", "n_steps_f = 200\n", "_f_train_loss = (\n", " [10.0 - (6.0 / 149) * i + random.uniform(-0.05, 0.05) for i in range(150)]\n", " + [12.0 + random.uniform(-1.0, 1.0) if (i - 150) % 5 == 0\n", " else 4.0 + random.uniform(-0.3, 0.3)\n", " for i in range(150, n_steps_f)]\n", ")\n", "mock_history_f = {\n", " \"step\": list(range(1, n_steps_f + 1)),\n", " \"train_loss\": _f_train_loss,\n", " \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_f)],\n", " \"grad_norm\": [random.uniform(0.3, 0.8) for _ in range(150)]\n", " + [random.uniform(0.5, 5.0) for _ in range(50)],\n", " \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_f)],\n", " \"gpu_mem_gb\": [random.uniform(1.0, 2.0) for _ in range(n_steps_f)],\n", " \"val_loss\": [9.5, 7.0, 5.0, 4.5],\n", " \"val_ppl\": [math.exp(v) for v in [9.5, 7.0, 5.0, 4.5]],\n", "}\n", "print(f\"\\nMock F — train_loss: spikes in last 50 steps\")\n", "print(f\" Expected: UNSTABLE (high variance in recent window)\")\n", "\n", "# --- Scenario G: overfitting (train decreasing, val increasing) ---\n", "# diagnose_status condition: val_trend == \"increasing\" AND second_half_avg < first_half_avg\n", "# LOSS_BOUNCE bypass: min_idx >= 85% of steps (min at end)\n", "n_steps_g = 200\n", "mock_history_g = {\n", " \"step\": list(range(1, n_steps_g + 1)),\n", " \"train_loss\": [8.0 - 0.025 * i + random.uniform(-0.05, 0.05)\n", " for i in range(n_steps_g)],\n", " \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_g)],\n", " \"grad_norm\": [random.uniform(0.3, 0.8) for _ in range(n_steps_g)],\n", " \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_g)],\n", " \"gpu_mem_gb\": [random.uniform(1.0, 2.0) for _ in range(n_steps_g)],\n", " \"val_loss\": [4.0, 4.2, 5.0, 5.5],\n", " \"val_ppl\": [math.exp(v) for v in [4.0, 4.2, 5.0, 5.5]],\n", "}\n", "print(f\"\\nMock G — train_loss: {mock_history_g['train_loss'][0]:.2f} -> {mock_history_g['train_loss'][-1]:.2f}\")\n", "print(f\" Expected: OVERFITTING (train down, val up)\")\n", "\n", "# --- Scenario H: insufficient data (only 1 step) ---\n", "# diagnose_status condition: len(train_losses) < 2\n", "mock_history_h = {\n", " \"step\": [1],\n", " \"train_loss\": [5.0],\n", " \"learning_rate\": [1e-4],\n", " \"grad_norm\": [0.5],\n", " \"tokens_per_sec\":[10000],\n", " \"gpu_mem_gb\": [1.5],\n", " \"val_loss\": [],\n", " \"val_ppl\": [],\n", "}\n", "print(f\"\\nMock H — train_loss: only 1 step\")\n", "print(f\" Expected: INSUFFICIENT_DATA (need >= 2 steps)\")" ], "id": "0fc90eec" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Level 0 — Status Diagnosis\n", "\n", "Classifies the current training state into one of 6 categories using only `metrics_history`:\n", "\n", "| Status | Meaning | Severity |\n", "|--------|---------|----------|\n", "| `NORMAL` | Training normally | green |\n", "| `NO_DECREASE` | Loss not decreasing | red |\n", "| `DIVERGING` | Loss diverging | red |\n", "| `PLATEAU` | Loss stuck at high value | yellow |\n", "| `OVERFITTING` | train↓ val↑ | yellow |\n", "| `UNSTABLE` | High loss variance | yellow |" ], "id": "0eb757cb" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Full validation of diagnose_status\n", "test_cases = [\n", " (\"mock_history_a\", mock_history_a, \"NO_DECREASE\"),\n", " (\"mock_history_b\", mock_history_b, \"NAN_DETECTED\"),\n", " (\"mock_history_c\", mock_history_c, \"LOSS_BOUNCE\"),\n", " (\"mock_history_d\", mock_history_d, \"PLATEAU\"),\n", " (\"mock_history_e\", mock_history_e, \"DIVERGING\"),\n", " (\"mock_history_f\", mock_history_f, \"UNSTABLE\"),\n", " (\"mock_history_g\", mock_history_g, \"OVERFITTING\"),\n", " (\"mock_history_h\", mock_history_h, \"INSUFFICIENT_DATA\"),\n", " (\"mock_history_normal\", mock_history_normal, \"NORMAL\"),\n", "]\n", "\n", "results = []\n", "for name, history, expected in test_cases:\n", " print(f\"\\n{'=' * 50}\")\n", " print(f\"Testing {name} — expected: {expected}\")\n", " print(\"=\" * 50)\n", " result = LossDebugger.diagnose_status(vocab_size, history)\n", " actual = result[\"status\"]\n", " match = \"PASS\" if actual == expected else \"FAIL\"\n", " results.append((name, expected, actual, match))\n", " print(f\"\\n>>> {match}: expected={expected}, got={actual}\")\n", "\n", "# Summary\n", "print(\"\\n\" + \"=\" * 60)\n", "print(\"SUMMARY\")\n", "print(\"=\" * 60)\n", "for name, expected, actual, match in results:\n", " icon = \"✅\" if match == \"PASS\" else \"❌\"\n", " print(f\" {icon} {name:25s} expected={expected:20s} got={actual}\")\n", "passed = sum(1 for *_, m in results if m == \"PASS\")\n", "print(f\"\\n {passed}/{len(results)} passed\")" ], "id": "b22fc65a" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metrics_path = \"/content/drive/MyDrive/llm-1b-lab/checkpoints/step_020000/metrics.pt\"\n", "metrics_history = torch.load(metrics_path, weights_only=False)\n", "\n", "result = LossDebugger.diagnose_status(vocab_size, metrics_history)\n", "print(f\"\\nReal checkpoint diagnosis: {result['status']}\")" ], "id": "da4a7552" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Level 1 — Data / Implementation Checks\n", "\n", "**70% of loss problems** originate from data pipeline bugs. Checks 6 things:\n", "\n", "1. **Shift relationship** — verify `targets[t] == input_ids[t+1]`\n", "2. **Token range** — verify `0 <= ids < vocab_size`\n", "3. **Initial loss** — verify `≈ ln(vocab_size)` from random weights\n", "4. **Single-batch overfit** — repeat one batch for 200 steps → verify loss ≈ 0\n", "5. **Tokenizer round-trip** — verify text is preserved after encode → decode\n", "6. **Data quality** — visually inspect sample text" ], "id": "793c1fc8" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "level1 = LossDebugger.check_data_pipeline(\n", " model=model,\n", " dataloader=train_dl,\n", " tokenizer=tokenizer,\n", " vocab_size=vocab_size,\n", " device=device,\n", " dtype=dtype,\n", ")\n", "\n", "print(f\"\\nPassed: {len(level1['passed'])}, Failed: {len(level1['failed'])}\")\n", "for f in level1['failed']:\n", " print(f\" FAILED: {f['name']} — {f['detail']}\")" ], "id": "d774e1c8" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Level 2 — Numerical Stability\n", "\n", "Runs one forward + backward pass to check:\n", "\n", "- **Mixed Precision config** — whether RMSNorm computes in fp32, loss dtype\n", "- **Gradient** — NaN/Inf/large gradient check\n", "- **Activation** — mean/std/max for each transformer layer\n", "- **Logit scale** — whether output logits are in a reasonable range (< 1000)\n", "\n", "### Mixed Precision Best Practices\n", "\n", "```python\n", "# ❌ Risk: adding large + small values in bf16 → small values lost\n", "# ✅ Fix: compute inside RMSNorm in float32\n", "x_float = x.float() # bf16 → fp32\n", "rms = torch.rsqrt(x_float.pow(2).mean(-1, keepdim=True) + eps)\n", "return (x_float * rms).to(x.dtype) * self.weight\n", "\n", "# ✅ Compute loss in float32 as well:\n", "logits_fp32 = logits.float()\n", "loss = F.cross_entropy(logits_fp32.view(-1, V), targets.view(-1))\n", "```\n", "\n", "### Common Numerical Issues\n", "\n", "| Symptom | Likely Cause | Solution |\n", "|---------|-------------|---------|\n", "| Loss → NaN | Very large values fed into Softmax | Check logit scale, review initialization |\n", "| Loss → Inf | log of 0 | Add eps, set ignore_index |\n", "| Loss oscillating | fp16 gradient underflow | Switch to bf16 or use GradScaler |\n", "| Late-stage NaN | Gradual activation increase | Verify RMSNorm, check weight decay |" ], "id": "eb7e5160" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "level2 = LossDebugger.check_numerical_stability(\n", " model=model,\n", " dataloader=train_dl,\n", " device=device,\n", " dtype=dtype,\n", ")" ], "id": "1beffcec" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Level 3 — Hyperparameter Diagnosis\n", "\n", "Analyzes `metrics_history` and `TrainConfig` to check:\n", "\n", "- **LR**: if grad norm frequently hits clip limit → LR too high\n", "- **Batch size**: if loss variance is large → batch too small\n", "- **Warmup**: if many initial loss spikes → warmup too short\n", "- **β₂**: whether using LLM standard (0.95) instead of PyTorch default (0.999)\n", "- **Weight Decay**: 0 risks overfitting; standard is 0.1\n", "\n", "### Tuning Priority (by impact)\n", "\n", "| Rank | Parameter | Impact | Notes |\n", "|------|-----------|--------|-------|\n", "| 1 | **Learning Rate** | 10x | Tune first |\n", "| 2 | **Batch Size** | High | Adjust together with LR |\n", "| 3 | **Warmup Steps** | Medium | Early-stage stability |\n", "| 4 | **Weight Decay** | Medium | Adjust when overfitting (usually fixed at 0.1) |\n", "| 5 | **β₁, β₂** | Low | β₂=0.95 recommended (LLaMA, TinyLlama, OLMo) |\n", "| 6 | **Gradient Clip** | Low | Usually fixed at 1.0 |\n", "\n", "### β₂ Selection Guide\n", "\n", "- **PyTorch default** β₂=0.999 → reflects gradient statistics of last ~1000 steps\n", "- **LLM standard** β₂=0.95 → reflects gradient statistics of last ~20 steps\n", "- LLM training has continuously shifting data distribution, so fast adaptation (lower β₂) is beneficial\n", "- Lower β₂ helps mitigate loss spikes (Cattaneo & Shigida 2025)\n", "\n", "### Batch-LR Scaling\n", "\n", "- Batch ×2 → LR ×√2 (square root scaling, safer)\n", "- Batch ×2 → LR ×2 (linear scaling, used in GPT-3 etc.)\n", "- Effective batch 64~512 is typical for 1B models" ], "id": "0cc8f6f9" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "level3 = LossDebugger.diagnose_hyperparameters(\n", " metrics_history=mock_history_a,\n", " config=train_config,\n", ")" ], "id": "217f1cd3" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1 LR Range Test\n", "\n", "Exponentially increases LR from `1e-7` → `1e-1` while recording loss.\n", "The recommended peak LR is the LR at the steepest loss decrease ÷ 3.\n", "\n", "> May take some time (approximately 1 minute for debug_10m).\n", "> Run this once before starting actual training." ], "id": "fea85ba4" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lr_result = LossDebugger.lr_range_test(\n", " model=model,\n", " dataloader=train_dl,\n", " device=device,\n", " dtype=dtype,\n", " steps=100, # debug_10m: 100 steps for speed\n", ")\n", "\n", "print(f\"\\nSuggested peak LR: {lr_result['suggested_lr']:.2e}\")" ], "id": "bfd64503" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Level 4 — Fitting Diagnosis\n", "\n", "Compares train/val loss trends to identify 4 cases:\n", "\n", "| Case | Train | Val | Diagnosis |\n", "|------|-------|-----|-----------|\n", "| 1 | ↓ | ↓ | Normal (training in progress) |\n", "| 2 | → | → | Underfitting (insufficient model/data) |\n", "| 3 | ↓ | ↑ | Overfitting (suspected data repetition) |\n", "| 4 | → | → (low) | Converged (training complete) |\n", "\n", "### Underfitting Diagnosis Steps\n", "\n", "1. **Insufficient model capacity?** → train a 2x larger model on same data → if loss drops, it's a capacity issue\n", "2. **Insufficient training?** → check if loss curve is still decreasing (Chinchilla: 1B → ~20B tokens)\n", "3. **Convergence too slow due to small LR?** → experiment with LR ×2\n", "4. **Data quality issue?** → read sampled data directly\n", "\n", "### Overfitting Diagnosis Steps\n", "\n", "1. **Insufficient/repeated data?** → overfitting risk spikes when epoch > 1\n", "2. **Insufficient weight decay?** → 0.1 is standard (LLaMA, TinyLlama, GPT-3, OLMo)\n", "3. **Insufficient data diversity?** → need to mix multiple domains\n", "\n", "### Important Facts About Overfitting in LLM Pretraining\n", "\n", "- Overfitting is very rare when training within **1 epoch**\n", "- When overfitting is seen, the cause is usually **data repetition** (epoch > 1)\n", "- **Dropout** is **almost never used** in modern LLM pretraining\n", " - Pythia, TinyLlama, OLMo, LLaMA all use dropout=0\n", " - With sufficient data, the data itself is the best regularization\n", " - Dropout is useful for fine-tuning with small data" ], "id": "edb91923" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_params = sum(p.numel() for p in model.parameters())\n", "total_tokens = len(mock_history_normal[\"train_loss\"]) * train_config.tokens_per_step\n", "\n", "level4 = LossDebugger.diagnose_fitting(\n", " metrics_history=mock_history_normal,\n", " model_params=model_params,\n", " total_tokens=total_tokens,\n", ")" ], "id": "505692ed" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Level 5 — Architecture / Initialization Check\n", "\n", "Runs one forward pass through the model to collect **per-layer activation statistics** and **weight distributions**.\n", "\n", "### Activation Diagnostics\n", "\n", "- **healthy**: std is stable across all layers\n", "- **exploding**: std increases sharply in later layers → initialization scale too large\n", "- **vanishing**: std decreases sharply in later layers → initialization scale too small\n", "- **anomaly**: sudden change at a specific layer → implementation bug in that layer\n", "\n", "### Weight Initialization Diagnostics\n", "\n", "GPT-2 style initialization:\n", "- **Standard Linear**: `N(0, 0.02)`\n", "- **Residual projection** (o_proj, down_proj): `N(0, 0.02/√(2×layers))`\n", " → Reduces residual contribution in deep models for stability\n", "\n", "### Ablation Study (verifying per-component impact)\n", "\n", "| Experiment | Expected Result | If Abnormal |\n", "|-----------|----------------|-------------|\n", "| RMSNorm → LayerNorm | Negligible loss difference | Normalization implementation bug |\n", "| RoPE → absolute positional embedding | Small difference for short sequences | Check RoPE implementation |\n", "| SwiGLU → ReLU FFN | Loss +0.05~0.15 | Check SwiGLU implementation |\n", "| GQA → MHA | Almost same loss (memory difference only) | KV repeat bug |" ], "id": "3b65db22" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "level5 = LossDebugger.check_architecture(\n", " model=model,\n", " dataloader=train_dl,\n", " device=device,\n", ")\n", "\n", "print(f\"\\n>>> Diagnosis: {level5['diagnosis']}\")" ], "id": "e1d7a64e" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Full Diagnostics (run_diagnostics)\n", "\n", "Runs all levels above at once. Use the `levels` parameter to select which levels to run.\n", "\n", "```python\n", "# Example usage after actual training:\n", "# report = LossDebugger.run_diagnostics(\n", "# model=model, dataloader=train_dl, tokenizer=tokenizer,\n", "# train_config=train_config,\n", "# metrics_history=trainer.metrics.history,\n", "# device=device, dtype=dtype,\n", "# )\n", "```" ], "id": "8cc8a5f1" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "report = LossDebugger.run_diagnostics(\n", " model=model,\n", " dataloader=train_dl,\n", " tokenizer=tokenizer,\n", " train_config=train_config,\n", " metrics_history=mock_history_a,\n", " device=device,\n", " dtype=dtype,\n", " vocab_size=vocab_size,\n", " levels=[0, 1, 2, 3, 4, 5],\n", ")" ], "id": "c7f3a147" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Study Roadmap\n", "\n", "A systematic learning path for LLM training optimization." ], "id": "093ac866" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "LossDebugger.print_study_roadmap()" ], "id": "fc94ba5f" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Debugging Tips\n", "\n", "**Recommended order:**\n", "1. Check status with Level 0 → tells you which level to inspect\n", "2. Level 1 (data) → first! 70% of problems are found here\n", "3. Level 2 (numerical stability) → resolve NaN/Inf issues\n", "4. Level 3 (hyperparameters) → tune LR first (10x impact)\n", "5. Level 4 (fitting diagnosis) → check after sufficient training\n", "6. Level 5 (architecture) → when cause not found in above levels\n", "\n", "**Scenario Detection** → automatically identifies which scenario based on symptoms\n", "\n", "### Key References\n", "\n", "| Resource | Content | Priority |\n", "|----------|---------|----------|\n", "| Karpathy \"Recipe for Training NNs\" | Practical debugging mindset | ⭐⭐⭐ |\n", "| Hoffmann et al. 2022 (Chinchilla) | Scaling Law essentials | ⭐⭐⭐ |\n", "| Kaplan et al. 2020 (Scaling Laws) | Loss prediction formula | ⭐⭐⭐ |\n", "| Touvron et al. 2023 (LLaMA) | 1B-scale model training details | ⭐⭐ |\n", "| Biderman et al. 2023 (Pythia) | Public training logs, reproducibility | ⭐⭐ |\n", "| Zhang et al. 2024 (TinyLlama) | 1.1B model trained on 3T tokens | ⭐⭐ |\n", "| Groeneveld et al. 2024 (OLMo) | Fully open LLM framework | ⭐⭐ |\n", "| Li et al. 2018 (Loss Landscape) | Intuition for loss topology | ⭐⭐ |\n", "| Loshchilov & Hutter 2019 (AdamW) | Optimizer fundamentals | ⭐⭐ |\n", "| Yang et al. 2022 (μP) | Hyperparameter transfer | ⭐ |\n", "\n", "---\n", "**Previous step:** Train the model in `03_training.ipynb`. \n", "**Next step:** Evaluate the trained model in `04_evaluation.ipynb`." ], "id": "eb140c33" } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 4 }