Remove redundant detect_scenario from LossDebugger

Level 0 diagnose_status already covers the same 4 scenarios (no decrease,
NaN, loss bounce, plateau) with better accuracy (moving-average smoothing,
val trend awareness, more granular categories). Remove the duplicate method
and its notebook section to avoid confusion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (2) hide show

llm_lab/training/debugger.py +0 -126
notebooks/05_debugging.ipynb +59 -134

llm_lab/training/debugger.py CHANGED Viewed

@@ -1289,129 +1289,6 @@ class LossDebugger:
             "weight_issues": weight_issues,
         }
-    # ───────────────────────────────────────────────────────────────
-    # Scenario Auto-Detection
-    # ───────────────────────────────────────────────────────────────
-    @staticmethod
-    def detect_scenario(
-        metrics_history: Dict[str, list],
-        vocab_size: int = 32000,
-    ) -> Dict[str, Any]:
-        """Auto-detect which debugging scenario applies.
-        Scenarios (from the guide):
-          A: Loss stuck at ~10.37 (doesn't decrease at all)
-          B: Loss was decreasing then suddenly NaN
-          C: Loss decreased to X then started increasing
-          D: Loss stuck at high value (e.g. 4.0 for 1B model)
-        """
-        print(_header("Scenario Auto-Detection"))
-        train_losses = metrics_history.get("train_loss", [])
-        val_losses = [v for v in metrics_history.get("val_loss", []) if v is not None]
-        expected_initial = math.log(vocab_size)
-        if len(train_losses) < 5:
-            print("  [!] Not enough data to detect scenario.")
-            return {"scenario": "unknown", "steps": []}
-        # Filter out NaN for statistics
-        valid_losses = [l for l in train_losses if not math.isnan(l)]
-        if len(valid_losses) < 5:
-            print("  [!] Not enough valid (non-NaN) data to detect scenario.")
-            return {"scenario": "unknown", "steps": []}
-        first_loss = valid_losses[0]
-        last_loss = valid_losses[-1]
-        has_nan = len(valid_losses) < len(train_losses)
-        min_loss = min(valid_losses)
-        min_loss_idx = next(i for i, l in enumerate(train_losses)
-                           if not math.isnan(l) and l == min_loss)
-        loss_recovered = last_loss > min_loss + 0.3 and min_loss_idx < len(train_losses) * 0.8
-        # Trend analysis: is loss still decreasing?
-        mid = len(valid_losses) // 2
-        first_half_avg = sum(valid_losses[:mid]) / mid
-        second_half_avg = sum(valid_losses[mid:]) / (len(valid_losses) - mid)
-        still_decreasing = (first_half_avg - second_half_avg) > 0.3
-        scenario = "unknown"
-        steps: List[str] = []
-        # Scenario A: Loss stuck near initial value
-        if abs(last_loss - expected_initial) < 1.5 and abs(first_loss - last_loss) < 0.5:
-            scenario = "A"
-            steps = [
-                "1. Run single-batch overfit test → if it fails, model/loss has a bug",
-                "2. Check if gradients are zero → optimizer.step() may be missing",
-                "3. Verify input_ids/targets shift → data pipeline bug",
-                "4. Check LR → is it set to 0?",
-                "5. Check model.train() → eval mode changes norm/dropout behavior",
-            ]
-        # Scenario B: NaN appeared
-        elif has_nan:
-            nan_idx = next(i for i, l in enumerate(train_losses) if math.isnan(l))
-            scenario = "B"
-            steps = [
-                f"1. NaN appeared at step ~{nan_idx}. Check that batch's data for bad tokens",
-                "2. Check gradient norm just before NaN → was there a spike?",
-                "3. Check LR schedule → does NaN coincide with warmup end?",
-                "4. Check specific layer weights for Inf values",
-                "5. Try switching to fp32 to see if it's a mixed precision issue",
-                "   (Pythia-1B had irrecoverable fp16 loss spikes → switched to bf16,",
-                "    Biderman et al. 2023)",
-            ]
-        # Scenario C: Loss decreased then increased
-        elif loss_recovered:
-            scenario = "C"
-            steps = [
-                "1. Check Train and Val loss simultaneously:",
-                "   - Both increasing → LR too high (check cosine decay)",
-                "   - Only train increasing → data quality changed (streaming order)",
-                "   - Only val increasing → overfitting started",
-                "2. Verify LR schedule is decaying as intended",
-                "3. Check data shuffling → same data repeating?",
-            ]
-        # Scenario D: Loss stuck at high value (not still decreasing)
-        elif (last_loss > _EXPECTED_TRAIN_LOSS[1]
-              and abs(last_loss - min_loss) < 0.3
-              and not still_decreasing):
-            scenario = "D"
-            total_tokens = len(train_losses) * 262144  # approximate
-            steps = [
-                f"1. Check total tokens trained: ~{total_tokens / 1e9:.1f}B "
-                f"(need 5-10B for 1B model)",
-                "2. Compare with smaller model (100M) at same step → "
-                "if 100M is lower, 1B may have a bug",
-                "3. Run LR range test → current LR may not be optimal",
-                "4. Sample training data → check for noise, duplicates, low quality",
-                "5. Try different effective batch size (64 vs 128 vs 256)",
-            ]
-        else:
-            scenario = "none"
-            steps = ["Training appears normal. No specific scenario detected."]
-        label = {
-            "A": "Loss stuck at initial value (~10.37)",
-            "B": "Loss was decreasing, then NaN",
-            "C": "Loss decreased then started increasing",
-            "D": f"Loss stuck at high value (>{_EXPECTED_TRAIN_LOSS[1]})",
-            "none": "No problematic scenario detected",
-            "unknown": "Cannot determine",
-        }
-        print(f"\n  Detected: Scenario {scenario} — {label.get(scenario, 'Unknown')}")
-        print(f"\n  Recommended debugging steps:")
-        for step in steps:
-            print(f"    {step}")
-        return {"scenario": scenario, "label": label.get(scenario, ""), "steps": steps}
     # ───────────────────────────────────────────────────────────────
     # Main Entry Point
     # ───────────────────────────────────────────────────────────────
@@ -1491,9 +1368,6 @@ class LossDebugger:
                 model, dataloader, device,
             )
-        # Auto-detect scenario
-        report["scenario"] = LossDebugger.detect_scenario(metrics_history, vocab_size)
         # Final summary
         print("\n" + "═" * 60)
         print("  Diagnostics Complete")

             "weight_issues": weight_issues,
         }
     # ───────────────────────────────────────────────────────────────
     # Main Entry Point
     # ───────────────────────────────────────────────────────────────
                 model, dataloader, device,
             )
         # Final summary
         print("\n" + "═" * 60)
         print("  Diagnostics Complete")

notebooks/05_debugging.ipynb CHANGED Viewed

@@ -18,7 +18,8 @@
     "Level 4: Fitting Diagnosis     ← overfitting vs underfitting\n",
     "Level 5: Architecture          ← initialization, per-layer activation\n",
     "```"
-   ]
   },
   {
    "cell_type": "code",
@@ -28,7 +29,8 @@
    "source": [
     "# No additional packages required\n",
     "# LossDebugger only uses torch and built-in llm_lab modules"
-   ]
   },
   {
    "cell_type": "code",
@@ -55,7 +57,8 @@
     "from llm_lab.data import setup_data_pipeline\n",
     "from llm_lab.training import LossDebugger\n",
     "from llm_lab.utils import auto_configure, get_device"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -64,7 +67,8 @@
     "## 0. Configuration\n",
     "\n",
     "Use the `debug_10m` preset so it runs quickly even on CPU."
-   ]
   },
   {
    "cell_type": "code",
@@ -89,7 +93,8 @@
     "print(f\"Device: {device}, dtype: {dtype}\")\n",
     "print(f\"Vocab size: {vocab_size:,}\")\n",
     "print(f\"Expected initial loss: ln({vocab_size}) = {math.log(vocab_size):.2f}\")"
-   ]
   },
   {
    "cell_type": "code",
@@ -103,7 +108,8 @@
     "\n",
     "# --- Data pipeline (GPT-2 tokenizer used automatically) ---\n",
     "tokenizer, train_dl, val_dl = setup_data_pipeline(config=data_config)"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -118,7 +124,8 @@
     "```python\n",
     "# metrics_history = trainer.metrics.history\n",
     "```"
-   ]
   },
   {
    "cell_type": "code",
@@ -134,7 +141,6 @@
     "# --- Scenario A: loss stuck near ln(vocab_size) ---\n",
     "# loss barely decreasing (suspected data/implementation bug)\n",
     "# diagnose_status condition: loss_change < 0.1\n",
-    "# detect_scenario condition: abs(last - initial) < 1.5, abs(first - last) < 0.5\n",
     "n_steps_a = 200\n",
     "mock_history_a = {\n",
     "    \"step\":          list(range(1, n_steps_a + 1)),\n",
@@ -308,7 +314,8 @@
     "}\n",
     "print(f\"\\nMock H — train_loss: only 1 step\")\n",
     "print(f\"  Expected: INSUFFICIENT_DATA (need >= 2 steps)\")"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -326,7 +333,8 @@
     "| `PLATEAU` | Loss stuck at high value | yellow |\n",
     "| `OVERFITTING` | train↓ val↑ | yellow |\n",
     "| `UNSTABLE` | High loss variance | yellow |"
-   ]
   },
   {
    "cell_type": "code",
@@ -367,7 +375,8 @@
     "    print(f\"  {icon} {name:25s} expected={expected:20s} got={actual}\")\n",
     "passed = sum(1 for *_, m in results if m == \"PASS\")\n",
     "print(f\"\\n  {passed}/{len(results)} passed\")"
-   ]
   },
   {
    "cell_type": "code",
@@ -380,7 +389,8 @@
     "\n",
     "result = LossDebugger.diagnose_status(vocab_size, metrics_history)\n",
     "print(f\"\\nReal checkpoint diagnosis: {result['status']}\")"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -396,7 +406,8 @@
     "4. **Single-batch overfit** — repeat one batch for 200 steps → verify loss ≈ 0\n",
     "5. **Tokenizer round-trip** — verify text is preserved after encode → decode\n",
     "6. **Data quality** — visually inspect sample text"
-   ]
   },
   {
    "cell_type": "code",
@@ -416,7 +427,8 @@
     "print(f\"\\nPassed: {len(level1['passed'])}, Failed: {len(level1['failed'])}\")\n",
     "for f in level1['failed']:\n",
     "    print(f\"  FAILED: {f['name']} — {f['detail']}\")"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -453,7 +465,8 @@
     "| Loss → Inf | log of 0 | Add eps, set ignore_index |\n",
     "| Loss oscillating | fp16 gradient underflow | Switch to bf16 or use GradScaler |\n",
     "| Late-stage NaN | Gradual activation increase | Verify RMSNorm, check weight decay |"
-   ]
   },
   {
    "cell_type": "code",
@@ -467,7 +480,8 @@
     "    device=device,\n",
     "    dtype=dtype,\n",
     ")"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -506,7 +520,8 @@
     "- Batch ×2 → LR ×√2 (square root scaling, safer)\n",
     "- Batch ×2 → LR ×2 (linear scaling, used in GPT-3 etc.)\n",
     "- Effective batch 64~512 is typical for 1B models"
-   ]
   },
   {
    "cell_type": "code",
@@ -518,7 +533,8 @@
     "    metrics_history=mock_history_a,\n",
     "    config=train_config,\n",
     ")"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -531,7 +547,8 @@
     "\n",
     "> May take some time (approximately 1 minute for debug_10m).\n",
     "> Run this once before starting actual training."
-   ]
   },
   {
    "cell_type": "code",
@@ -548,7 +565,8 @@
     ")\n",
     "\n",
     "print(f\"\\nSuggested peak LR: {lr_result['suggested_lr']:.2e}\")"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -586,7 +604,8 @@
     "  - Pythia, TinyLlama, OLMo, LLaMA all use dropout=0\n",
     "  - With sufficient data, the data itself is the best regularization\n",
     "  - Dropout is useful for fine-tuning with small data"
-   ]
   },
   {
    "cell_type": "code",
@@ -602,7 +621,8 @@
     "    model_params=model_params,\n",
     "    total_tokens=total_tokens,\n",
     ")"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -634,7 +654,8 @@
     "| RoPE → absolute positional embedding | Small difference for short sequences | Check RoPE implementation |\n",
     "| SwiGLU → ReLU FFN | Loss +0.05~0.15 | Check SwiGLU implementation |\n",
     "| GQA → MHA | Almost same loss (memory difference only) | KV repeat bug |"
-   ]
   },
   {
    "cell_type": "code",
@@ -649,115 +670,14 @@
     ")\n",
     "\n",
     "print(f\"\\n>>> Diagnosis: {level5['diagnosis']}\")"
-   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 7. Scenario Auto-Detection\n",
-    "\n",
-    "Analyzes `metrics_history` to automatically identify which of the 4 scenarios applies:\n",
-    "\n",
-    "| Scenario | Symptom | Main Cause | Key Diagnostic |\n",
-    "|----------|---------|-----------|----------------|\n",
-    "| **A** | Loss stuck near ~10.37 | Data/implementation bug | Single-batch overfit test |\n",
-    "| **B** | Sudden NaN during loss decrease | Numerical instability | Check grad norm just before NaN |\n",
-    "| **C** | Loss increases after decreasing | LR/data issue | Check train/val simultaneously |\n",
-    "| **D** | Loss stuck at high value | Insufficient training/LR issue | Check token count, run LR Range Test |\n",
-    "\n",
-    "### Scenario-Specific Diagnostic Points\n",
-    "\n",
-    "**Scenario A**: \"Loss isn't decreasing from 10.37 at all\"\n",
-    "1. Single-batch overfit → if fails, model/loss bug\n",
-    "2. Check if gradients are 0 → missing `optimizer.step()`?\n",
-    "3. Check `input_ids/targets` shift → data pipeline bug\n",
-    "4. Check LR → is it set to 0?\n",
-    "5. Verify `model.train()` is called\n",
-    "\n",
-    "**Scenario B**: \"Loss was decreasing then suddenly NaN\"\n",
-    "1. Inspect the batch data at that step\n",
-    "2. Check gradient norm spike just before NaN\n",
-    "3. Compare LR schedule with NaN timing\n",
-    "4. Mixed precision issue → try reproducing with fp32\n",
-    "   - (Pythia-1B: fp16 → bf16 switch case, Biderman et al. 2023)\n",
-    "\n",
-    "**Scenario C**: \"Loss decreased to 3.5 then went back up\"\n",
-    "1. Check train/val simultaneously:\n",
-    "   - Both up → LR too high\n",
-    "   - Only train up → data quality change (streaming order)\n",
-    "   - Only val up → overfitting starting\n",
-    "2. Check LR schedule, check data shuffling\n",
-    "\n",
-    "**Scenario D**: \"Loss won't go below 4.0\"\n",
-    "1. Check training token count (if < 5B, insufficient training)\n",
-    "2. Compare with 100M model\n",
-    "3. Run LR Range Test\n",
-    "4. Sample and inspect data quality"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# --- Scenario A ---\n",
-    "print(\"=\" * 50)\n",
-    "print(\"Testing Scenario A (loss stuck at initial value)\")\n",
-    "print(\"=\" * 50)\n",
-    "scenario_a = LossDebugger.detect_scenario(\n",
-    "    metrics_history=mock_history_a,\n",
-    "    vocab_size=vocab_size,\n",
-    ")\n",
-    "print(f\"\\n>>> Detected: Scenario {scenario_a['scenario']}\")\n",
-    "\n",
-    "# --- Scenario B ---\n",
-    "print(\"\\n\" + \"=\" * 50)\n",
-    "print(\"Testing Scenario B (NaN appeared)\")\n",
-    "print(\"=\" * 50)\n",
-    "scenario_b = LossDebugger.detect_scenario(\n",
-    "    metrics_history=mock_history_b,\n",
-    "    vocab_size=vocab_size,\n",
-    ")\n",
-    "print(f\"\\n>>> Detected: Scenario {scenario_b['scenario']}\")\n",
-    "\n",
-    "# --- Scenario C ---\n",
-    "print(\"\\n\" + \"=\" * 50)\n",
-    "print(\"Testing Scenario C (loss bounce)\")\n",
-    "print(\"=\" * 50)\n",
-    "scenario_c = LossDebugger.detect_scenario(\n",
-    "    metrics_history=mock_history_c,\n",
-    "    vocab_size=vocab_size,\n",
-    ")\n",
-    "print(f\"\\n>>> Detected: Scenario {scenario_c['scenario']}\")\n",
-    "\n",
-    "# --- Scenario D ---\n",
-    "print(\"\\n\" + \"=\" * 50)\n",
-    "print(\"Testing Scenario D (loss plateau)\")\n",
-    "print(\"=\" * 50)\n",
-    "scenario_d = LossDebugger.detect_scenario(\n",
-    "    metrics_history=mock_history_d,\n",
-    "    vocab_size=vocab_size,\n",
-    ")\n",
-    "print(f\"\\n>>> Detected: Scenario {scenario_d['scenario']}\")\n",
-    "\n",
-    "# --- Normal ---\n",
-    "print(\"\\n\" + \"=\" * 50)\n",
-    "print(\"Testing Normal scenario\")\n",
-    "print(\"=\" * 50)\n",
-    "scenario_n = LossDebugger.detect_scenario(\n",
-    "    metrics_history=mock_history_normal,\n",
-    "    vocab_size=vocab_size,\n",
-    ")\n",
-    "print(f\"\\n>>> Detected: Scenario {scenario_n['scenario']}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 8. Full Diagnostics (run_diagnostics)\n",
     "\n",
     "Runs all levels above at once. Use the `levels` parameter to select which levels to run.\n",
     "\n",
@@ -770,7 +690,8 @@
     "#     device=device, dtype=dtype,\n",
     "# )\n",
     "```"
-   ]
   },
   {
    "cell_type": "code",
@@ -789,16 +710,18 @@
     "    vocab_size=vocab_size,\n",
     "    levels=[0, 1, 2, 3, 4, 5],\n",
     ")"
-   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 9. Study Roadmap\n",
     "\n",
     "A systematic learning path for LLM training optimization."
-   ]
   },
   {
    "cell_type": "code",
@@ -807,13 +730,14 @@
    "outputs": [],
    "source": [
     "LossDebugger.print_study_roadmap()"
-   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 10. Debugging Tips\n",
     "\n",
     "**Recommended order:**\n",
     "1. Check status with Level 0 → tells you which level to inspect\n",
@@ -843,7 +767,8 @@
     "---\n",
     "**Previous step:** Train the model in `03_training.ipynb`.  \n",
     "**Next step:** Evaluate the trained model in `04_evaluation.ipynb`."
-   ]
   }
  ],
  "metadata": {

     "Level 4: Fitting Diagnosis     ← overfitting vs underfitting\n",
     "Level 5: Architecture          ← initialization, per-layer activation\n",
     "```"
+   ],
+   "id": "19f7e954"
   },
   {
    "cell_type": "code",
    "source": [
     "# No additional packages required\n",
     "# LossDebugger only uses torch and built-in llm_lab modules"
+   ],
+   "id": "af6a605f"
   },
   {
    "cell_type": "code",
     "from llm_lab.data import setup_data_pipeline\n",
     "from llm_lab.training import LossDebugger\n",
     "from llm_lab.utils import auto_configure, get_device"
+   ],
+   "id": "23b92c8d"
   },
   {
    "cell_type": "markdown",
     "## 0. Configuration\n",
     "\n",
     "Use the `debug_10m` preset so it runs quickly even on CPU."
+   ],
+   "id": "f3daf6ad"
   },
   {
    "cell_type": "code",
     "print(f\"Device: {device}, dtype: {dtype}\")\n",
     "print(f\"Vocab size: {vocab_size:,}\")\n",
     "print(f\"Expected initial loss: ln({vocab_size}) = {math.log(vocab_size):.2f}\")"
+   ],
+   "id": "dc7aa763"
   },
   {
    "cell_type": "code",
     "\n",
     "# --- Data pipeline (GPT-2 tokenizer used automatically) ---\n",
     "tokenizer, train_dl, val_dl = setup_data_pipeline(config=data_config)"
+   ],
+   "id": "6ed5cdad"
   },
   {
    "cell_type": "markdown",
     "```python\n",
     "# metrics_history = trainer.metrics.history\n",
     "```"
+   ],
+   "id": "58a40e59"
   },
   {
    "cell_type": "code",
     "# --- Scenario A: loss stuck near ln(vocab_size) ---\n",
     "# loss barely decreasing (suspected data/implementation bug)\n",
     "# diagnose_status condition: loss_change < 0.1\n",
     "n_steps_a = 200\n",
     "mock_history_a = {\n",
     "    \"step\":          list(range(1, n_steps_a + 1)),\n",
     "}\n",
     "print(f\"\\nMock H — train_loss: only 1 step\")\n",
     "print(f\"  Expected: INSUFFICIENT_DATA (need >= 2 steps)\")"
+   ],
+   "id": "0fc90eec"
   },
   {
    "cell_type": "markdown",
     "| `PLATEAU` | Loss stuck at high value | yellow |\n",
     "| `OVERFITTING` | train↓ val↑ | yellow |\n",
     "| `UNSTABLE` | High loss variance | yellow |"
+   ],
+   "id": "0eb757cb"
   },
   {
    "cell_type": "code",
     "    print(f\"  {icon} {name:25s} expected={expected:20s} got={actual}\")\n",
     "passed = sum(1 for *_, m in results if m == \"PASS\")\n",
     "print(f\"\\n  {passed}/{len(results)} passed\")"
+   ],
+   "id": "b22fc65a"
   },
   {
    "cell_type": "code",
     "\n",
     "result = LossDebugger.diagnose_status(vocab_size, metrics_history)\n",
     "print(f\"\\nReal checkpoint diagnosis: {result['status']}\")"
+   ],
+   "id": "da4a7552"
   },
   {
    "cell_type": "markdown",
     "4. **Single-batch overfit** — repeat one batch for 200 steps → verify loss ≈ 0\n",
     "5. **Tokenizer round-trip** — verify text is preserved after encode → decode\n",
     "6. **Data quality** — visually inspect sample text"
+   ],
+   "id": "793c1fc8"
   },
   {
    "cell_type": "code",
     "print(f\"\\nPassed: {len(level1['passed'])}, Failed: {len(level1['failed'])}\")\n",
     "for f in level1['failed']:\n",
     "    print(f\"  FAILED: {f['name']} — {f['detail']}\")"
+   ],
+   "id": "d774e1c8"
   },
   {
    "cell_type": "markdown",
     "| Loss → Inf | log of 0 | Add eps, set ignore_index |\n",
     "| Loss oscillating | fp16 gradient underflow | Switch to bf16 or use GradScaler |\n",
     "| Late-stage NaN | Gradual activation increase | Verify RMSNorm, check weight decay |"
+   ],
+   "id": "eb7e5160"
   },
   {
    "cell_type": "code",
     "    device=device,\n",
     "    dtype=dtype,\n",
     ")"
+   ],
+   "id": "1beffcec"
   },
   {
    "cell_type": "markdown",
     "- Batch ×2 → LR ×√2 (square root scaling, safer)\n",
     "- Batch ×2 → LR ×2 (linear scaling, used in GPT-3 etc.)\n",
     "- Effective batch 64~512 is typical for 1B models"
+   ],
+   "id": "0cc8f6f9"
   },
   {
    "cell_type": "code",
     "    metrics_history=mock_history_a,\n",
     "    config=train_config,\n",
     ")"
+   ],
+   "id": "217f1cd3"
   },
   {
    "cell_type": "markdown",
     "\n",
     "> May take some time (approximately 1 minute for debug_10m).\n",
     "> Run this once before starting actual training."
+   ],
+   "id": "fea85ba4"
   },
   {
    "cell_type": "code",
     ")\n",
     "\n",
     "print(f\"\\nSuggested peak LR: {lr_result['suggested_lr']:.2e}\")"
+   ],
+   "id": "bfd64503"
   },
   {
    "cell_type": "markdown",
     "  - Pythia, TinyLlama, OLMo, LLaMA all use dropout=0\n",
     "  - With sufficient data, the data itself is the best regularization\n",
     "  - Dropout is useful for fine-tuning with small data"
+   ],
+   "id": "edb91923"
   },
   {
    "cell_type": "code",
     "    model_params=model_params,\n",
     "    total_tokens=total_tokens,\n",
     ")"
+   ],
+   "id": "505692ed"
   },
   {
    "cell_type": "markdown",
     "| RoPE → absolute positional embedding | Small difference for short sequences | Check RoPE implementation |\n",
     "| SwiGLU → ReLU FFN | Loss +0.05~0.15 | Check SwiGLU implementation |\n",
     "| GQA → MHA | Almost same loss (memory difference only) | KV repeat bug |"
+   ],
+   "id": "3b65db22"
   },
   {
    "cell_type": "code",
     ")\n",
     "\n",
     "print(f\"\\n>>> Diagnosis: {level5['diagnosis']}\")"
+   ],
+   "id": "e1d7a64e"
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## 7. Full Diagnostics (run_diagnostics)\n",
     "\n",
     "Runs all levels above at once. Use the `levels` parameter to select which levels to run.\n",
     "\n",
     "#     device=device, dtype=dtype,\n",
     "# )\n",
     "```"
+   ],
+   "id": "8cc8a5f1"
   },
   {
    "cell_type": "code",
     "    vocab_size=vocab_size,\n",
     "    levels=[0, 1, 2, 3, 4, 5],\n",
     ")"
+   ],
+   "id": "c7f3a147"
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## 8. Study Roadmap\n",
     "\n",
     "A systematic learning path for LLM training optimization."
+   ],
+   "id": "093ac866"
   },
   {
    "cell_type": "code",
    "outputs": [],
    "source": [
     "LossDebugger.print_study_roadmap()"
+   ],
+   "id": "fc94ba5f"
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## 9. Debugging Tips\n",
     "\n",
     "**Recommended order:**\n",
     "1. Check status with Level 0 → tells you which level to inspect\n",
     "---\n",
     "**Previous step:** Train the model in `03_training.ipynb`.  \n",
     "**Next step:** Evaluate the trained model in `04_evaluation.ipynb`."
+   ],
+   "id": "eb140c33"
   }
  ],
  "metadata": {