File size: 32,231 Bytes

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 05. Loss Debugging (5-Level Diagnostic Framework)\n",
    "\n",
    "A framework for systematically diagnosing why loss is not decreasing as expected during training.\n",
    "\n",
    "**Always check from the lowest level first** — find data bugs before tuning hyperparameters.\n",
    "\n",
    "```\n",
    "Level 0: Status Diagnosis      ← Classify current training state (6 types)\n",
    "Level 1: Data / Implementation ← Most common cause (70%)\n",
    "Level 2: Numerical Stability   ← NaN/Inf, activation explosion\n",
    "Level 3: Hyperparameters       ← LR, batch size, warmup\n",
    "Level 4: Fitting Diagnosis     ← overfitting vs underfitting\n",
    "Level 5: Architecture          ← initialization, per-layer activation\n",
    "```"
   ],
   "id": "19f7e954"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# No additional packages required\n",
    "# LossDebugger only uses torch and built-in llm_lab modules"
   ],
   "id": "af6a605f"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "try:\n",
    "    import google.colab\n",
    "    from google.colab import drive\n",
    "    drive.mount('/content/drive')\n",
    "    project_path = '/content/drive/MyDrive/Colab Notebooks/LLM-1B-Lab'\n",
    "    sys.path.append(project_path)\n",
    "except ImportError:\n",
    "    sys.path.insert(0, '..')\n",
    "\n",
    "import math\n",
    "import torch\n",
    "\n",
    "from llm_lab.config import ModelConfig, DataConfig, TrainConfig\n",
    "from llm_lab.model import LLMModel\n",
    "from llm_lab.data import setup_data_pipeline\n",
    "from llm_lab.training import LossDebugger\n",
    "from llm_lab.utils import auto_configure, get_device"
   ],
   "id": "23b92c8d"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Configuration\n",
    "\n",
    "Use the `debug_10m` preset so it runs quickly even on CPU."
   ],
   "id": "f3daf6ad"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- Config ---\n",
    "model_config = ModelConfig.base_1b()\n",
    "data_config = DataConfig(\n",
    "    max_seq_len=model_config.max_seq_len,\n",
    "    batch_size=4,\n",
    ")\n",
    "train_config = TrainConfig.base_1b()\n",
    "\n",
    "# --- Device / dtype ---\n",
    "train_config = auto_configure(train_config)\n",
    "device = get_device()\n",
    "dtype = train_config.torch_dtype\n",
    "\n",
    "vocab_size = data_config.vocab_size\n",
    "print(f\"Device: {device}, dtype: {dtype}\")\n",
    "print(f\"Vocab size: {vocab_size:,}\")\n",
    "print(f\"Expected initial loss: ln({vocab_size}) = {math.log(vocab_size):.2f}\")"
   ],
   "id": "dc7aa763"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- Model ---\n",
    "model = LLMModel(model_config).to(device)\n",
    "print(f\"Model parameters: {model.count_parameters():,}\")\n",
    "\n",
    "# --- Data pipeline (GPT-2 tokenizer used automatically) ---\n",
    "tokenizer, train_dl, val_dl = setup_data_pipeline(config=data_config)"
   ],
   "id": "6ed5cdad"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 0.1 Training History (Mock Metrics History)\n",
    "\n",
    "A mock `metrics_history` is provided so you can test Level 0 / 3 / 4 / scenario detection\n",
    "without running actual training.\n",
    "\n",
    "After actual training, use `trainer.metrics.history` instead:\n",
    "```python\n",
    "# metrics_history = trainer.metrics.history\n",
    "```"
   ],
   "id": "58a40e59"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "random.seed(42)\n",
    "\n",
    "expected_initial = math.log(vocab_size)  # ~10.37\n",
    "\n",
    "# --- Scenario A: loss stuck near ln(vocab_size) ---\n",
    "# loss barely decreasing (suspected data/implementation bug)\n",
    "# diagnose_status condition: loss_change < 0.1\n",
    "n_steps_a = 200\n",
    "mock_history_a = {\n",
    "    \"step\":          list(range(1, n_steps_a + 1)),\n",
    "    \"train_loss\":    [expected_initial - 0.0002 * i + random.uniform(-0.02, 0.02)\n",
    "                      for i in range(n_steps_a)],\n",
    "    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_a)],\n",
    "    \"grad_norm\":     [random.uniform(0.3, 0.8) for _ in range(n_steps_a)],\n",
    "    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_a)],\n",
    "    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_a)],\n",
    "    \"val_loss\":      [expected_initial + random.uniform(-0.1, 0.1)\n",
    "                      for _ in range(0, n_steps_a, 50)],\n",
    "    \"val_ppl\":       [math.exp(expected_initial) + random.uniform(-50, 50)\n",
    "                      for _ in range(0, n_steps_a, 50)],\n",
    "}\n",
    "print(f\"Mock A — train_loss: {mock_history_a['train_loss'][0]:.2f} -> {mock_history_a['train_loss'][-1]:.2f}\")\n",
    "print(f\"  Expected: NO_DECREASE (loss barely changes)\")\n",
    "\n",
    "# --- Scenario B: loss decreasing then NaN ---\n",
    "n_steps_b = 200\n",
    "mock_history_b = {\n",
    "    \"step\":          list(range(1, n_steps_b + 1)),\n",
    "    \"train_loss\":    [expected_initial - 0.03 * i + random.uniform(-0.05, 0.05)\n",
    "                      for i in range(150)]\n",
    "                     + [float('nan')] * 50,\n",
    "    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_b)],\n",
    "    \"grad_norm\":     [random.uniform(0.3, 0.8) for _ in range(145)]\n",
    "                     + [random.uniform(5.0, 50.0) for _ in range(5)]\n",
    "                     + [float('nan')] * 50,\n",
    "    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_b)],\n",
    "    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_b)],\n",
    "    \"val_loss\":      [expected_initial - 0.03 * i + random.uniform(-0.1, 0.1)\n",
    "                      for i in range(0, n_steps_b, 50)],\n",
    "    \"val_ppl\":       [math.exp(expected_initial - 0.03 * i)\n",
    "                      for i in range(0, n_steps_b, 50)],\n",
    "}\n",
    "print(f\"\\nMock B — train_loss starts normal, then NaN at step ~150\")\n",
    "print(f\"  Expected: Scenario B (NaN detected)\")\n",
    "\n",
    "# --- Scenario C: loss decreased then increased ---\n",
    "n_steps_c = 200\n",
    "_c_val_losses = (\n",
    "    [expected_initial - 0.02 * i + random.uniform(-0.1, 0.1)\n",
    "     for i in range(0, 120, 50)]\n",
    "    + [expected_initial - 0.02 * 120 + 0.03 * (i - 120) + random.uniform(-0.1, 0.1)\n",
    "       for i in range(120, n_steps_c, 50)]\n",
    ")\n",
    "mock_history_c = {\n",
    "    \"step\":          list(range(1, n_steps_c + 1)),\n",
    "    \"train_loss\":    [expected_initial - 0.04 * i + random.uniform(-0.05, 0.05)\n",
    "                      for i in range(120)]\n",
    "                     + [expected_initial - 0.04 * 120 + 0.02 * (i - 120) + random.uniform(-0.05, 0.05)\n",
    "                        for i in range(120, n_steps_c)],\n",
    "    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_c)],\n",
    "    \"grad_norm\":     [random.uniform(0.3, 0.8) for _ in range(n_steps_c)],\n",
    "    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_c)],\n",
    "    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_c)],\n",
    "    \"val_loss\":      _c_val_losses,\n",
    "    \"val_ppl\":       [math.exp(v) for v in _c_val_losses],\n",
    "}\n",
    "print(f\"\\nMock C — train_loss: decrease → increase (bounce)\")\n",
    "print(f\"  Expected: Scenario C (loss recovery)\")\n",
    "\n",
    "# --- Scenario D: loss stuck at high value (4.0) ---\n",
    "n_steps_d = 200\n",
    "mock_history_d = {\n",
    "    \"step\":          list(range(1, n_steps_d + 1)),\n",
    "    \"train_loss\":    [4.0 + random.uniform(-0.1, 0.1) for _ in range(n_steps_d)],\n",
    "    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_d)],\n",
    "    \"grad_norm\":     [random.uniform(0.1, 0.3) for _ in range(n_steps_d)],\n",
    "    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_d)],\n",
    "    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_d)],\n",
    "    \"val_loss\":      [4.2 + random.uniform(-0.1, 0.1)\n",
    "                      for _ in range(0, n_steps_d, 50)],\n",
    "    \"val_ppl\":       [math.exp(4.2) + random.uniform(-5, 5)\n",
    "                      for _ in range(0, n_steps_d, 50)],\n",
    "}\n",
    "print(f\"\\nMock D — train_loss stuck at ~4.0\")\n",
    "print(f\"  Expected: Scenario D (plateau at high value)\")\n",
    "\n",
    "# --- Normal: loss decreasing normally ---\n",
    "n_steps_n = 200\n",
    "mock_history_normal = {\n",
    "    \"step\":          list(range(1, n_steps_n + 1)),\n",
    "    \"train_loss\":    [expected_initial - 0.03 * i + random.uniform(-0.05, 0.05)\n",
    "                      for i in range(n_steps_n)],\n",
    "    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_n)],\n",
    "    \"grad_norm\":     [random.uniform(0.3, 0.8) for _ in range(n_steps_n)],\n",
    "    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_n)],\n",
    "    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_n)],\n",
    "    \"val_loss\":      [expected_initial - 0.03 * i + random.uniform(-0.1, 0.1)\n",
    "                      for i in range(0, n_steps_n, 50)],\n",
    "    \"val_ppl\":       [math.exp(expected_initial - 0.03 * i)\n",
    "                      for i in range(0, n_steps_n, 50)],\n",
    "}\n",
    "print(f\"\\nMock Normal — train_loss: {mock_history_normal['train_loss'][0]:.2f} -> {mock_history_normal['train_loss'][-1]:.2f}\")\n",
    "print(f\"  Expected: NORMAL (loss decreasing steadily)\")\n",
    "\n",
    "# --- Scenario E: loss diverging (increasing beyond initial) ---\n",
    "# diagnose_status condition: last_loss > expected_initial + 1.0\n",
    "# NO_DECREASE bypass: first_loss(5.0) < expected_initial - 2.0(8.37)\n",
    "n_steps_e = 200\n",
    "mock_history_e = {\n",
    "    \"step\":          list(range(1, n_steps_e + 1)),\n",
    "    \"train_loss\":    [5.0 + (7.0 / 199) * i + random.uniform(-0.1, 0.1)\n",
    "                      for i in range(n_steps_e)],\n",
    "    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_e)],\n",
    "    \"grad_norm\":     [random.uniform(0.5, 2.0) for _ in range(n_steps_e)],\n",
    "    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_e)],\n",
    "    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_e)],\n",
    "    \"val_loss\":      [5.5 + (6.0 / 3) * i + random.uniform(-0.1, 0.1)\n",
    "                      for i in range(4)],\n",
    "    \"val_ppl\":       [math.exp(5.5 + (6.0 / 3) * i)\n",
    "                      for i in range(4)],\n",
    "}\n",
    "print(f\"\\nMock E — train_loss: {mock_history_e['train_loss'][0]:.2f} -> {mock_history_e['train_loss'][-1]:.2f}\")\n",
    "print(f\"  Expected: DIVERGING (loss exceeds initial value)\")\n",
    "\n",
    "# --- Scenario F: unstable training (large spikes in recent steps) ---\n",
    "# diagnose_status condition: recent_std > 0.5 * recent_mean (last 50 steps)\n",
    "# NO_DECREASE bypass: loss_change > 0.1 (10.0 -> ~4.0)\n",
    "n_steps_f = 200\n",
    "_f_train_loss = (\n",
    "    [10.0 - (6.0 / 149) * i + random.uniform(-0.05, 0.05) for i in range(150)]\n",
    "    + [12.0 + random.uniform(-1.0, 1.0) if (i - 150) % 5 == 0\n",
    "       else 4.0 + random.uniform(-0.3, 0.3)\n",
    "       for i in range(150, n_steps_f)]\n",
    ")\n",
    "mock_history_f = {\n",
    "    \"step\":          list(range(1, n_steps_f + 1)),\n",
    "    \"train_loss\":    _f_train_loss,\n",
    "    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_f)],\n",
    "    \"grad_norm\":     [random.uniform(0.3, 0.8) for _ in range(150)]\n",
    "                     + [random.uniform(0.5, 5.0) for _ in range(50)],\n",
    "    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_f)],\n",
    "    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_f)],\n",
    "    \"val_loss\":      [9.5, 7.0, 5.0, 4.5],\n",
    "    \"val_ppl\":       [math.exp(v) for v in [9.5, 7.0, 5.0, 4.5]],\n",
    "}\n",
    "print(f\"\\nMock F — train_loss: spikes in last 50 steps\")\n",
    "print(f\"  Expected: UNSTABLE (high variance in recent window)\")\n",
    "\n",
    "# --- Scenario G: overfitting (train decreasing, val increasing) ---\n",
    "# diagnose_status condition: val_trend == \"increasing\" AND second_half_avg < first_half_avg\n",
    "# LOSS_BOUNCE bypass: min_idx >= 85% of steps (min at end)\n",
    "n_steps_g = 200\n",
    "mock_history_g = {\n",
    "    \"step\":          list(range(1, n_steps_g + 1)),\n",
    "    \"train_loss\":    [8.0 - 0.025 * i + random.uniform(-0.05, 0.05)\n",
    "                      for i in range(n_steps_g)],\n",
    "    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_g)],\n",
    "    \"grad_norm\":     [random.uniform(0.3, 0.8) for _ in range(n_steps_g)],\n",
    "    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_g)],\n",
    "    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_g)],\n",
    "    \"val_loss\":      [4.0, 4.2, 5.0, 5.5],\n",
    "    \"val_ppl\":       [math.exp(v) for v in [4.0, 4.2, 5.0, 5.5]],\n",
    "}\n",
    "print(f\"\\nMock G — train_loss: {mock_history_g['train_loss'][0]:.2f} -> {mock_history_g['train_loss'][-1]:.2f}\")\n",
    "print(f\"  Expected: OVERFITTING (train down, val up)\")\n",
    "\n",
    "# --- Scenario H: insufficient data (only 1 step) ---\n",
    "# diagnose_status condition: len(train_losses) < 2\n",
    "mock_history_h = {\n",
    "    \"step\":          [1],\n",
    "    \"train_loss\":    [5.0],\n",
    "    \"learning_rate\": [1e-4],\n",
    "    \"grad_norm\":     [0.5],\n",
    "    \"tokens_per_sec\":[10000],\n",
    "    \"gpu_mem_gb\":    [1.5],\n",
    "    \"val_loss\":      [],\n",
    "    \"val_ppl\":       [],\n",
    "}\n",
    "print(f\"\\nMock H — train_loss: only 1 step\")\n",
    "print(f\"  Expected: INSUFFICIENT_DATA (need >= 2 steps)\")"
   ],
   "id": "0fc90eec"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Level 0 — Status Diagnosis\n",
    "\n",
    "Classifies the current training state into one of 6 categories using only `metrics_history`:\n",
    "\n",
    "| Status | Meaning | Severity |\n",
    "|--------|---------|----------|\n",
    "| `NORMAL` | Training normally | green |\n",
    "| `NO_DECREASE` | Loss not decreasing | red |\n",
    "| `DIVERGING` | Loss diverging | red |\n",
    "| `PLATEAU` | Loss stuck at high value | yellow |\n",
    "| `OVERFITTING` | train↓ val↑ | yellow |\n",
    "| `UNSTABLE` | High loss variance | yellow |"
   ],
   "id": "0eb757cb"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Full validation of diagnose_status\n",
    "test_cases = [\n",
    "    (\"mock_history_a\",      mock_history_a,      \"NO_DECREASE\"),\n",
    "    (\"mock_history_b\",      mock_history_b,      \"NAN_DETECTED\"),\n",
    "    (\"mock_history_c\",      mock_history_c,      \"LOSS_BOUNCE\"),\n",
    "    (\"mock_history_d\",      mock_history_d,      \"PLATEAU\"),\n",
    "    (\"mock_history_e\",      mock_history_e,      \"DIVERGING\"),\n",
    "    (\"mock_history_f\",      mock_history_f,      \"UNSTABLE\"),\n",
    "    (\"mock_history_g\",      mock_history_g,      \"OVERFITTING\"),\n",
    "    (\"mock_history_h\",      mock_history_h,      \"INSUFFICIENT_DATA\"),\n",
    "    (\"mock_history_normal\", mock_history_normal,  \"NORMAL\"),\n",
    "]\n",
    "\n",
    "results = []\n",
    "for name, history, expected in test_cases:\n",
    "    print(f\"\\n{'=' * 50}\")\n",
    "    print(f\"Testing {name} — expected: {expected}\")\n",
    "    print(\"=\" * 50)\n",
    "    result = LossDebugger.diagnose_status(vocab_size, history)\n",
    "    actual = result[\"status\"]\n",
    "    match = \"PASS\" if actual == expected else \"FAIL\"\n",
    "    results.append((name, expected, actual, match))\n",
    "    print(f\"\\n>>> {match}: expected={expected}, got={actual}\")\n",
    "\n",
    "# Summary\n",
    "print(\"\\n\" + \"=\" * 60)\n",
    "print(\"SUMMARY\")\n",
    "print(\"=\" * 60)\n",
    "for name, expected, actual, match in results:\n",
    "    icon = \"✅\" if match == \"PASS\" else \"❌\"\n",
    "    print(f\"  {icon} {name:25s} expected={expected:20s} got={actual}\")\n",
    "passed = sum(1 for *_, m in results if m == \"PASS\")\n",
    "print(f\"\\n  {passed}/{len(results)} passed\")"
   ],
   "id": "b22fc65a"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metrics_path = \"/content/drive/MyDrive/llm-1b-lab/checkpoints/step_020000/metrics.pt\"\n",
    "metrics_history = torch.load(metrics_path, weights_only=False)\n",
    "\n",
    "result = LossDebugger.diagnose_status(vocab_size, metrics_history)\n",
    "print(f\"\\nReal checkpoint diagnosis: {result['status']}\")"
   ],
   "id": "da4a7552"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Level 1 — Data / Implementation Checks\n",
    "\n",
    "**70% of loss problems** originate from data pipeline bugs. Checks 6 things:\n",
    "\n",
    "1. **Shift relationship** — verify `targets[t] == input_ids[t+1]`\n",
    "2. **Token range** — verify `0 <= ids < vocab_size`\n",
    "3. **Initial loss** — verify `≈ ln(vocab_size)` from random weights\n",
    "4. **Single-batch overfit** — repeat one batch for 200 steps → verify loss ≈ 0\n",
    "5. **Tokenizer round-trip** — verify text is preserved after encode → decode\n",
    "6. **Data quality** — visually inspect sample text"
   ],
   "id": "793c1fc8"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "level1 = LossDebugger.check_data_pipeline(\n",
    "    model=model,\n",
    "    dataloader=train_dl,\n",
    "    tokenizer=tokenizer,\n",
    "    vocab_size=vocab_size,\n",
    "    device=device,\n",
    "    dtype=dtype,\n",
    ")\n",
    "\n",
    "print(f\"\\nPassed: {len(level1['passed'])}, Failed: {len(level1['failed'])}\")\n",
    "for f in level1['failed']:\n",
    "    print(f\"  FAILED: {f['name']} — {f['detail']}\")"
   ],
   "id": "d774e1c8"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Level 2 — Numerical Stability\n",
    "\n",
    "Runs one forward + backward pass to check:\n",
    "\n",
    "- **Mixed Precision config** — whether RMSNorm computes in fp32, loss dtype\n",
    "- **Gradient** — NaN/Inf/large gradient check\n",
    "- **Activation** — mean/std/max for each transformer layer\n",
    "- **Logit scale** — whether output logits are in a reasonable range (< 1000)\n",
    "\n",
    "### Mixed Precision Best Practices\n",
    "\n",
    "```python\n",
    "# ❌ Risk: adding large + small values in bf16 → small values lost\n",
    "# ✅ Fix: compute inside RMSNorm in float32\n",
    "x_float = x.float()  # bf16 → fp32\n",
    "rms = torch.rsqrt(x_float.pow(2).mean(-1, keepdim=True) + eps)\n",
    "return (x_float * rms).to(x.dtype) * self.weight\n",
    "\n",
    "# ✅ Compute loss in float32 as well:\n",
    "logits_fp32 = logits.float()\n",
    "loss = F.cross_entropy(logits_fp32.view(-1, V), targets.view(-1))\n",
    "```\n",
    "\n",
    "### Common Numerical Issues\n",
    "\n",
    "| Symptom | Likely Cause | Solution |\n",
    "|---------|-------------|---------|\n",
    "| Loss → NaN | Very large values fed into Softmax | Check logit scale, review initialization |\n",
    "| Loss → Inf | log of 0 | Add eps, set ignore_index |\n",
    "| Loss oscillating | fp16 gradient underflow | Switch to bf16 or use GradScaler |\n",
    "| Late-stage NaN | Gradual activation increase | Verify RMSNorm, check weight decay |"
   ],
   "id": "eb7e5160"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "level2 = LossDebugger.check_numerical_stability(\n",
    "    model=model,\n",
    "    dataloader=train_dl,\n",
    "    device=device,\n",
    "    dtype=dtype,\n",
    ")"
   ],
   "id": "1beffcec"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Level 3 — Hyperparameter Diagnosis\n",
    "\n",
    "Analyzes `metrics_history` and `TrainConfig` to check:\n",
    "\n",
    "- **LR**: if grad norm frequently hits clip limit → LR too high\n",
    "- **Batch size**: if loss variance is large → batch too small\n",
    "- **Warmup**: if many initial loss spikes → warmup too short\n",
    "- **β₂**: whether using LLM standard (0.95) instead of PyTorch default (0.999)\n",
    "- **Weight Decay**: 0 risks overfitting; standard is 0.1\n",
    "\n",
    "### Tuning Priority (by impact)\n",
    "\n",
    "| Rank | Parameter | Impact | Notes |\n",
    "|------|-----------|--------|-------|\n",
    "| 1 | **Learning Rate** | 10x | Tune first |\n",
    "| 2 | **Batch Size** | High | Adjust together with LR |\n",
    "| 3 | **Warmup Steps** | Medium | Early-stage stability |\n",
    "| 4 | **Weight Decay** | Medium | Adjust when overfitting (usually fixed at 0.1) |\n",
    "| 5 | **β₁, β₂** | Low | β₂=0.95 recommended (LLaMA, TinyLlama, OLMo) |\n",
    "| 6 | **Gradient Clip** | Low | Usually fixed at 1.0 |\n",
    "\n",
    "### β₂ Selection Guide\n",
    "\n",
    "- **PyTorch default** β₂=0.999 → reflects gradient statistics of last ~1000 steps\n",
    "- **LLM standard** β₂=0.95 → reflects gradient statistics of last ~20 steps\n",
    "- LLM training has continuously shifting data distribution, so fast adaptation (lower β₂) is beneficial\n",
    "- Lower β₂ helps mitigate loss spikes (Cattaneo & Shigida 2025)\n",
    "\n",
    "### Batch-LR Scaling\n",
    "\n",
    "- Batch ×2 → LR ×√2 (square root scaling, safer)\n",
    "- Batch ×2 → LR ×2 (linear scaling, used in GPT-3 etc.)\n",
    "- Effective batch 64~512 is typical for 1B models"
   ],
   "id": "0cc8f6f9"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "level3 = LossDebugger.diagnose_hyperparameters(\n",
    "    metrics_history=mock_history_a,\n",
    "    config=train_config,\n",
    ")"
   ],
   "id": "217f1cd3"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 LR Range Test\n",
    "\n",
    "Exponentially increases LR from `1e-7` → `1e-1` while recording loss.\n",
    "The recommended peak LR is the LR at the steepest loss decrease ÷ 3.\n",
    "\n",
    "> May take some time (approximately 1 minute for debug_10m).\n",
    "> Run this once before starting actual training."
   ],
   "id": "fea85ba4"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr_result = LossDebugger.lr_range_test(\n",
    "    model=model,\n",
    "    dataloader=train_dl,\n",
    "    device=device,\n",
    "    dtype=dtype,\n",
    "    steps=100,  # debug_10m: 100 steps for speed\n",
    ")\n",
    "\n",
    "print(f\"\\nSuggested peak LR: {lr_result['suggested_lr']:.2e}\")"
   ],
   "id": "bfd64503"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Level 4 — Fitting Diagnosis\n",
    "\n",
    "Compares train/val loss trends to identify 4 cases:\n",
    "\n",
    "| Case | Train | Val | Diagnosis |\n",
    "|------|-------|-----|-----------|\n",
    "| 1 | ↓ | ↓ | Normal (training in progress) |\n",
    "| 2 | → | → | Underfitting (insufficient model/data) |\n",
    "| 3 | ↓ | ↑ | Overfitting (suspected data repetition) |\n",
    "| 4 | → | → (low) | Converged (training complete) |\n",
    "\n",
    "### Underfitting Diagnosis Steps\n",
    "\n",
    "1. **Insufficient model capacity?** → train a 2x larger model on same data → if loss drops, it's a capacity issue\n",
    "2. **Insufficient training?** → check if loss curve is still decreasing (Chinchilla: 1B → ~20B tokens)\n",
    "3. **Convergence too slow due to small LR?** → experiment with LR ×2\n",
    "4. **Data quality issue?** → read sampled data directly\n",
    "\n",
    "### Overfitting Diagnosis Steps\n",
    "\n",
    "1. **Insufficient/repeated data?** → overfitting risk spikes when epoch > 1\n",
    "2. **Insufficient weight decay?** → 0.1 is standard (LLaMA, TinyLlama, GPT-3, OLMo)\n",
    "3. **Insufficient data diversity?** → need to mix multiple domains\n",
    "\n",
    "### Important Facts About Overfitting in LLM Pretraining\n",
    "\n",
    "- Overfitting is very rare when training within **1 epoch**\n",
    "- When overfitting is seen, the cause is usually **data repetition** (epoch > 1)\n",
    "- **Dropout** is **almost never used** in modern LLM pretraining\n",
    "  - Pythia, TinyLlama, OLMo, LLaMA all use dropout=0\n",
    "  - With sufficient data, the data itself is the best regularization\n",
    "  - Dropout is useful for fine-tuning with small data"
   ],
   "id": "edb91923"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_params = sum(p.numel() for p in model.parameters())\n",
    "total_tokens = len(mock_history_normal[\"train_loss\"]) * train_config.tokens_per_step\n",
    "\n",
    "level4 = LossDebugger.diagnose_fitting(\n",
    "    metrics_history=mock_history_normal,\n",
    "    model_params=model_params,\n",
    "    total_tokens=total_tokens,\n",
    ")"
   ],
   "id": "505692ed"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Level 5 — Architecture / Initialization Check\n",
    "\n",
    "Runs one forward pass through the model to collect **per-layer activation statistics** and **weight distributions**.\n",
    "\n",
    "### Activation Diagnostics\n",
    "\n",
    "- **healthy**: std is stable across all layers\n",
    "- **exploding**: std increases sharply in later layers → initialization scale too large\n",
    "- **vanishing**: std decreases sharply in later layers → initialization scale too small\n",
    "- **anomaly**: sudden change at a specific layer → implementation bug in that layer\n",
    "\n",
    "### Weight Initialization Diagnostics\n",
    "\n",
    "GPT-2 style initialization:\n",
    "- **Standard Linear**: `N(0, 0.02)`\n",
    "- **Residual projection** (o_proj, down_proj): `N(0, 0.02/√(2×layers))`\n",
    "  → Reduces residual contribution in deep models for stability\n",
    "\n",
    "### Ablation Study (verifying per-component impact)\n",
    "\n",
    "| Experiment | Expected Result | If Abnormal |\n",
    "|-----------|----------------|-------------|\n",
    "| RMSNorm → LayerNorm | Negligible loss difference | Normalization implementation bug |\n",
    "| RoPE → absolute positional embedding | Small difference for short sequences | Check RoPE implementation |\n",
    "| SwiGLU → ReLU FFN | Loss +0.05~0.15 | Check SwiGLU implementation |\n",
    "| GQA → MHA | Almost same loss (memory difference only) | KV repeat bug |"
   ],
   "id": "3b65db22"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "level5 = LossDebugger.check_architecture(\n",
    "    model=model,\n",
    "    dataloader=train_dl,\n",
    "    device=device,\n",
    ")\n",
    "\n",
    "print(f\"\\n>>> Diagnosis: {level5['diagnosis']}\")"
   ],
   "id": "e1d7a64e"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Full Diagnostics (run_diagnostics)\n",
    "\n",
    "Runs all levels above at once. Use the `levels` parameter to select which levels to run.\n",
    "\n",
    "```python\n",
    "# Example usage after actual training:\n",
    "# report = LossDebugger.run_diagnostics(\n",
    "#     model=model, dataloader=train_dl, tokenizer=tokenizer,\n",
    "#     train_config=train_config,\n",
    "#     metrics_history=trainer.metrics.history,\n",
    "#     device=device, dtype=dtype,\n",
    "# )\n",
    "```"
   ],
   "id": "8cc8a5f1"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "report = LossDebugger.run_diagnostics(\n",
    "    model=model,\n",
    "    dataloader=train_dl,\n",
    "    tokenizer=tokenizer,\n",
    "    train_config=train_config,\n",
    "    metrics_history=mock_history_a,\n",
    "    device=device,\n",
    "    dtype=dtype,\n",
    "    vocab_size=vocab_size,\n",
    "    levels=[0, 1, 2, 3, 4, 5],\n",
    ")"
   ],
   "id": "c7f3a147"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Study Roadmap\n",
    "\n",
    "A systematic learning path for LLM training optimization."
   ],
   "id": "093ac866"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "LossDebugger.print_study_roadmap()"
   ],
   "id": "fc94ba5f"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Debugging Tips\n",
    "\n",
    "**Recommended order:**\n",
    "1. Check status with Level 0 → tells you which level to inspect\n",
    "2. Level 1 (data) → first! 70% of problems are found here\n",
    "3. Level 2 (numerical stability) → resolve NaN/Inf issues\n",
    "4. Level 3 (hyperparameters) → tune LR first (10x impact)\n",
    "5. Level 4 (fitting diagnosis) → check after sufficient training\n",
    "6. Level 5 (architecture) → when cause not found in above levels\n",
    "\n",
    "**Scenario Detection** → automatically identifies which scenario based on symptoms\n",
    "\n",
    "### Key References\n",
    "\n",
    "| Resource | Content | Priority |\n",
    "|----------|---------|----------|\n",
    "| Karpathy \"Recipe for Training NNs\" | Practical debugging mindset | ⭐⭐⭐ |\n",
    "| Hoffmann et al. 2022 (Chinchilla) | Scaling Law essentials | ⭐⭐⭐ |\n",
    "| Kaplan et al. 2020 (Scaling Laws) | Loss prediction formula | ⭐⭐⭐ |\n",
    "| Touvron et al. 2023 (LLaMA) | 1B-scale model training details | ⭐⭐ |\n",
    "| Biderman et al. 2023 (Pythia) | Public training logs, reproducibility | ⭐⭐ |\n",
    "| Zhang et al. 2024 (TinyLlama) | 1.1B model trained on 3T tokens | ⭐⭐ |\n",
    "| Groeneveld et al. 2024 (OLMo) | Fully open LLM framework | ⭐⭐ |\n",
    "| Li et al. 2018 (Loss Landscape) | Intuition for loss topology | ⭐⭐ |\n",
    "| Loshchilov & Hutter 2019 (AdamW) | Optimizer fundamentals | ⭐⭐ |\n",
    "| Yang et al. 2022 (μP) | Hyperparameter transfer | ⭐ |\n",
    "\n",
    "---\n",
    "**Previous step:** Train the model in `03_training.ipynb`.  \n",
    "**Next step:** Evaluate the trained model in `04_evaluation.ipynb`."
   ],
   "id": "eb140c33"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}