feat(training): add LossDebugger 5-level diagnostic framework

Systematic debugging tool for diagnosing LLM training loss issues:
- Level 0: auto-classify training health status (6 categories)
- Level 1: data/implementation bug checks (shift, token range, overfit test)
- Level 2: numerical stability (mixed precision config, gradient/activation
NaN/Inf detection, common issues reference table)
- Level 3: hyperparameter analysis (LR, β₂, weight decay, batch-LR scaling,
GPT-3 LR reference table, warmup reference, LR range test)
- Level 4: overfitting vs underfitting diagnosis (detailed sub-cause analysis,
Chinchilla token ratio, dropout note for pretraining)
- Level 5: architecture checks (per-layer activation stats, weight init
distribution analysis, ablation study reference)
- Scenario auto-detection (A/B/C/D) with step-by-step recommendations
- Study roadmap with recommended experiments and key references

Also adds 05_debugging.ipynb with mock scenarios A/B/C/D for all
diagnostic levels and educational content from the optimization guide.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (4) hide show

CLAUDE.md +4 -2
llm_lab/training/__init__.py +2 -1
llm_lab/training/debugger.py +1442 -0
notebooks/05_debugging.ipynb +384 -0

CLAUDE.md CHANGED Viewed

@@ -35,7 +35,8 @@ LLM_Foundation_Model/
 │   │   ├── metrics.py                # MetricsTracker (wandb integration)
 │   │   ├── optimizer.py              # create_optimizer (weight decay separation)
 │   │   ├── trainer.py                # Trainer (gradient accumulation, mixed precision)
-│   │   └── runner.py                 # start_training (one-line helper)
 │   ├── evaluation/                   # Evaluation & analysis
 │   │   ├── perplexity.py             # PerplexityEvaluator (including per-position loss)
 │   │   ├── generation.py             # GenerationEvaluator (various prompts)
@@ -52,7 +53,8 @@ LLM_Foundation_Model/
 │   ├── 01_data_pipeline.ipynb
 │   ├── 02_model.ipynb
 │   ├── 03_training.ipynb
-│   └── 04_evaluation.ipynb
 └── _archive/                         # Original single-file backups
     ├── llm-1b-model.py
     ├── llm-1b-data-pipeline.py

 │   │   ├── metrics.py                # MetricsTracker (wandb integration)
 │   │   ├── optimizer.py              # create_optimizer (weight decay separation)
 │   │   ├── trainer.py                # Trainer (gradient accumulation, mixed precision)
+│   │   ├── runner.py                 # start_training (one-line helper)
+│   │   └── debugger.py               # LossDebugger (5-level diagnostic framework)
 │   ├── evaluation/                   # Evaluation & analysis
 │   │   ├── perplexity.py             # PerplexityEvaluator (including per-position loss)
 │   │   ├── generation.py             # GenerationEvaluator (various prompts)
 │   ├── 01_data_pipeline.ipynb
 │   ├── 02_model.ipynb
 │   ├── 03_training.ipynb
+│   ├── 04_evaluation.ipynb
+│   └── 05_debugging.ipynb
 └── _archive/                         # Original single-file backups
     ├── llm-1b-model.py
     ├── llm-1b-data-pipeline.py

llm_lab/training/__init__.py CHANGED Viewed

@@ -5,8 +5,9 @@ from .metrics import MetricsTracker
 from .optimizer import create_optimizer
 from .trainer import Trainer
 from .runner import start_training
 __all__ = [
     "CosineWarmupScheduler", "CheckpointManager", "MetricsTracker",
-    "create_optimizer", "Trainer", "start_training",
 ]

 from .optimizer import create_optimizer
 from .trainer import Trainer
 from .runner import start_training
+from .debugger import LossDebugger
 __all__ = [
     "CosineWarmupScheduler", "CheckpointManager", "MetricsTracker",
+    "create_optimizer", "Trainer", "start_training", "LossDebugger",
 ]

llm_lab/training/debugger.py ADDED Viewed

	@@ -0,0 +1,1442 @@

+"""LLM Loss Debugging & Optimization Framework.
+A systematic 5-level debugging framework for diagnosing training issues.
+Always start from Level 1 — fixing lower-level bugs before tuning
+hyperparameters saves time.
+Levels:
+  0. Status Diagnosis   — classify current training health
+  1. Data/Implementation — most common cause (70% of issues)
+  2. Numerical Stability — dtype, normalization, gradient health
+  3. Hyperparameters     — LR, batch size, warmup
+  4. Fitting Diagnosis   — overfitting vs underfitting
+  5. Architecture        — initialization, component checks
+"""
+import copy
+import math
+from typing import Any, Dict, List, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.utils.data import DataLoader
+from llm_lab.config import TrainConfig
+# ═══════════════════════════════════════════════════════════════════
+# Constants
+# ═══════════════════════════════════════════════════════════════════
+# Normal convergence ranges for a 1B model trained on ~10B tokens
+_EXPECTED_TRAIN_LOSS = (2.8, 3.5)
+_EXPECTED_VAL_LOSS = (3.0, 3.8)
+_EXPECTED_VAL_PPL = (20, 45)
+# Status labels
+STATUS_NORMAL = "NORMAL"
+STATUS_NO_DECREASE = "NO_DECREASE"
+STATUS_DIVERGING = "DIVERGING"
+STATUS_PLATEAU = "PLATEAU"
+STATUS_OVERFITTING = "OVERFITTING"
+STATUS_UNSTABLE = "UNSTABLE"
+# GPT-3 LR reference by model size (Brown et al. 2020, Table 2.1)
+# (param_count, recommended_lr, batch_tokens_str)
+_GPT3_LR_REFERENCE = [
+    (125e6, 6e-4, "0.5M"),
+    (350e6, 3e-4, "0.5M"),
+    (1.3e9, 2e-4, "1M"),
+    (2.7e9, 1.6e-4, "1M"),
+    (6.7e9, 1.2e-4, "2M"),
+    (175e9, 6e-5, "3.2M"),
+]
+# Known LLM training references
+_LLM_TRAINING_REFS = {
+    "TinyLlama-1.1B": {"lr": 4e-4, "beta2": 0.95, "wd": 0.1, "warmup": 2000},
+    "LLaMA-7B": {"lr": 3e-4, "beta2": 0.95, "wd": 0.1, "warmup": 2000},
+    "Pythia-1B": {"lr": 3e-4, "beta2": 0.95, "wd": 0.1},
+    "OLMo-1B": {"lr": 4e-4, "beta2": 0.95, "wd": 0.1},
+}
+# Recommended β₂ for LLM training
+_RECOMMENDED_BETA2 = 0.95
+_DEFAULT_PYTORCH_BETA2 = 0.999
+def _header(title: str) -> str:
+    return f"\n{'=' * 60}\n{title}\n{'=' * 60}"
+def _check_result(name: str, passed: bool, detail: str = "") -> Dict[str, Any]:
+    return {"name": name, "passed": passed, "detail": detail}
+# ═══════════════════════════════════════════════════════════════════
+# LossDebugger
+# ═══════════════════════════════════════════════════════════════════
+class LossDebugger:
+    """5-level loss debugging framework for LLM training.
+    Usage::
+        from llm_lab.training.debugger import LossDebugger
+        # Quick status check
+        status = LossDebugger.diagnose_status(vocab_size=32000,
+                                               metrics_history=trainer.metrics.history)
+        # Full diagnostics
+        report = LossDebugger.run_diagnostics(
+            model=model, dataloader=train_dl, tokenizer=tok,
+            train_config=train_cfg, metrics_history=trainer.metrics.history,
+            device=device, dtype=torch.bfloat16,
+        )
+    """
+    # ───────────────────────────────────────────────────────────────
+    # Level 0: Status Diagnosis
+    # ───────────────────────────────────────────────────────────────
+    @staticmethod
+    def diagnose_status(
+        vocab_size: int,
+        metrics_history: Dict[str, list],
+    ) -> Dict[str, Any]:
+        """Classify current training health from metrics history.
+        Args:
+            vocab_size: model vocabulary size (e.g. 32000)
+            metrics_history: dict with keys 'train_loss', 'val_loss', etc.
+        Returns:
+            dict with 'status', 'severity', 'details', 'recommended_levels'
+        """
+        print(_header("Level 0: Training Status Diagnosis"))
+        expected_initial = math.log(vocab_size)
+        print(f"  Expected initial loss (random weights): ln({vocab_size}) = {expected_initial:.2f}")
+        print(f"  Normal convergence range (1B, 10B tokens):")
+        print(f"    Train Loss: {_EXPECTED_TRAIN_LOSS[0]} ~ {_EXPECTED_TRAIN_LOSS[1]}")
+        print(f"    Val Loss:   {_EXPECTED_VAL_LOSS[0]} ~ {_EXPECTED_VAL_LOSS[1]}")
+        print(f"    Val PPL:    {_EXPECTED_VAL_PPL[0]} ~ {_EXPECTED_VAL_PPL[1]}")
+        train_losses = metrics_history.get("train_loss", [])
+        val_losses = [v for v in metrics_history.get("val_loss", []) if v is not None]
+        if len(train_losses) < 2:
+            print("\n  [!] Not enough training data to diagnose. Run more steps first.")
+            return {
+                "status": "INSUFFICIENT_DATA",
+                "severity": "unknown",
+                "details": "Need at least 2 logged train loss values.",
+                "recommended_levels": [1],
+            }
+        first_loss = train_losses[0]
+        last_loss = train_losses[-1]
+        loss_change = first_loss - last_loss
+        # Split into halves for trend analysis
+        mid = len(train_losses) // 2
+        first_half_avg = sum(train_losses[:mid]) / mid
+        second_half_avg = sum(train_losses[mid:]) / (len(train_losses) - mid)
+        # Recent window for spike detection
+        recent_n = min(50, len(train_losses))
+        recent = train_losses[-recent_n:]
+        recent_mean = sum(recent) / len(recent)
+        recent_var = sum((x - recent_mean) ** 2 for x in recent) / len(recent)
+        recent_std = recent_var ** 0.5
+        # Val trend
+        val_trend = "unknown"
+        if len(val_losses) >= 2:
+            val_mid = len(val_losses) // 2
+            val_first_avg = sum(val_losses[:max(val_mid, 1)]) / max(val_mid, 1)
+            val_second_avg = sum(val_losses[val_mid:]) / max(len(val_losses) - val_mid, 1)
+            if val_second_avg < val_first_avg - 0.05:
+                val_trend = "decreasing"
+            elif val_second_avg > val_first_avg + 0.1:
+                val_trend = "increasing"
+            else:
+                val_trend = "flat"
+        # ── Classify ──
+        status = STATUS_NORMAL
+        severity = "green"
+        details = ""
+        recommended_levels: List[int] = []
+        # Check 1: No decrease at all
+        if loss_change < 0.1 and first_loss > expected_initial - 2.0:
+            status = STATUS_NO_DECREASE
+            severity = "red"
+            details = (
+                f"Loss barely changed: {first_loss:.4f} -> {last_loss:.4f} "
+                f"(delta={loss_change:.4f}). Likely a data or implementation bug."
+            )
+            recommended_levels = [1, 2]
+        # Check 2: Diverging
+        elif last_loss > expected_initial + 1.0:
+            status = STATUS_DIVERGING
+            severity = "red"
+            details = (
+                f"Loss ({last_loss:.4f}) exceeds initial value ({expected_initial:.2f}). "
+                f"Training is diverging — check LR, data, or numerical issues."
+            )
+            recommended_levels = [1, 2, 3]
+        # Check 3: Unstable (large spikes)
+        elif recent_std > 0.5 * recent_mean:
+            status = STATUS_UNSTABLE
+            severity = "yellow"
+            details = (
+                f"High loss variance: std={recent_std:.4f}, mean={recent_mean:.4f}. "
+                f"Training is unstable — likely LR too high or batch too small."
+            )
+            recommended_levels = [3, 2]
+        # Check 4: Overfitting
+        elif val_trend == "increasing" and second_half_avg < first_half_avg:
+            status = STATUS_OVERFITTING
+            severity = "yellow"
+            details = (
+                f"Train loss decreasing but val loss increasing. "
+                f"Train trend: {first_half_avg:.4f} -> {second_half_avg:.4f}, "
+                f"Val trend: {val_trend}."
+            )
+            recommended_levels = [4]
+        # Check 5: Plateau
+        elif abs(second_half_avg - first_half_avg) < 0.05 and last_loss > _EXPECTED_TRAIN_LOSS[1]:
+            status = STATUS_PLATEAU
+            severity = "yellow"
+            details = (
+                f"Loss has plateaued: first half avg={first_half_avg:.4f}, "
+                f"second half avg={second_half_avg:.4f}. "
+                f"Current loss ({last_loss:.4f}) is above expected range."
+            )
+            recommended_levels = [3, 4, 5]
+        # Normal
+        else:
+            status = STATUS_NORMAL
+            severity = "green"
+            details = (
+                f"Training looks healthy: {first_loss:.4f} -> {last_loss:.4f} "
+                f"(delta={loss_change:.4f}). Val trend: {val_trend}."
+            )
+            recommended_levels = []
+        # ── Print ──
+        icons = {"red": "🔴", "yellow": "🟡", "green": "🟢"}
+        icon = icons.get(severity, "⚪")
+        print(f"\n  {icon} Status: {status}")
+        print(f"  {details}")
+        if recommended_levels:
+            print(f"  Recommended: check Level(s) {recommended_levels}")
+        return {
+            "status": status,
+            "severity": severity,
+            "details": details,
+            "recommended_levels": recommended_levels,
+        }
+    # ─────────────────────────────────────────────────��─────────────
+    # Level 1: Data / Implementation Bug Checks
+    # ───────────────────────────────────────────────────────────────
+    @staticmethod
+    def check_data_pipeline(
+        model: nn.Module,
+        dataloader: DataLoader,
+        tokenizer: Any,
+        vocab_size: int,
+        device: torch.device,
+        dtype: torch.dtype = torch.bfloat16,
+    ) -> Dict[str, Any]:
+        """Run 6 data/implementation checks (Level 1).
+        This is the most important level — 70% of loss issues are data bugs.
+        Checks:
+          1. Shift relationship (targets[t] == input_ids[t+1])
+          2. Token range (0 <= ids < vocab_size)
+          3. Initial loss (≈ ln(vocab_size) for random weights)
+          4. Single-batch overfit (loss → ~0 in 200 steps)
+          5. Tokenizer roundtrip (encode→decode preserves text)
+          6. Data quality sampling (visual inspection)
+        """
+        print(_header("Level 1: Data / Implementation Bug Checks"))
+        print("  (70% of loss issues come from data pipeline bugs)\n")
+        results: List[Dict[str, Any]] = []
+        batch = next(iter(dataloader))
+        input_ids = batch["input_ids"]
+        targets = batch["targets"]
+        # ── Check 1: Shift relationship ──
+        shift_match = (input_ids[:, 1:] == targets[:, :-1]).float().mean().item()
+        passed = shift_match > 0.99
+        detail = f"Shift consistency: {shift_match * 100:.1f}% (should be ~100%)"
+        results.append(_check_result("Shift relationship", passed, detail))
+        icon = "✅" if passed else "❌"
+        print(f"  {icon} Check 1: {detail}")
+        # ── Check 2: Token range ──
+        min_id = input_ids.min().item()
+        max_id = input_ids.max().item()
+        range_ok = min_id >= 0 and max_id < vocab_size
+        detail = f"Token range: [{min_id}, {max_id}], vocab_size={vocab_size}"
+        results.append(_check_result("Token range", range_ok, detail))
+        icon = "✅" if range_ok else "❌"
+        print(f"  {icon} Check 2: {detail}")
+        # ── Check 3: Initial loss ──
+        expected_loss = math.log(vocab_size)
+        model_copy = copy.deepcopy(model)
+        model_copy._init_weights()  # re-initialize to random
+        model_copy.to(device)
+        model_copy.eval()
+        with torch.no_grad():
+            with torch.amp.autocast(device_type="cuda", dtype=dtype, enabled=(dtype != torch.float32)):
+                _, initial_loss = model_copy(
+                    input_ids.to(device),
+                    targets.to(device),
+                )
+        initial_loss_val = initial_loss.item()
+        loss_diff = abs(initial_loss_val - expected_loss)
+        loss_ok = loss_diff < 1.0
+        detail = (
+            f"Initial loss: {initial_loss_val:.4f} vs expected {expected_loss:.2f} "
+            f"(diff={loss_diff:.4f})"
+        )
+        results.append(_check_result("Initial loss", loss_ok, detail))
+        icon = "✅" if loss_ok else "❌"
+        print(f"  {icon} Check 3: {detail}")
+        if initial_loss_val > expected_loss + 1.0:
+            print(f"       Hint: loss >> ln(V) suggests label mismatch or loss function bug")
+        elif initial_loss_val < expected_loss - 2.0:
+            print(f"       Hint: loss << ln(V) suggests data leakage")
+        del model_copy
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        # ── Check 4: Single-batch overfit test ──
+        print(f"\n  ⏳ Check 4: Single-batch overfit test (200 steps)...")
+        overfit_model = copy.deepcopy(model)
+        overfit_model.to(device)
+        overfit_model.train()
+        overfit_optimizer = torch.optim.AdamW(overfit_model.parameters(), lr=1e-3)
+        single_input = input_ids[:1].to(device)  # single sample
+        single_target = targets[:1].to(device)
+        overfit_losses = []
+        for step in range(200):
+            overfit_optimizer.zero_grad()
+            with torch.amp.autocast(device_type="cuda", dtype=dtype, enabled=(dtype != torch.float32)):
+                _, loss = overfit_model(single_input, single_target)
+            loss.backward()
+            overfit_optimizer.step()
+            overfit_losses.append(loss.item())
+            if (step + 1) % 50 == 0:
+                print(f"       Step {step + 1}: Loss = {loss.item():.4f}")
+        final_overfit_loss = overfit_losses[-1]
+        overfit_ok = final_overfit_loss < 0.1
+        detail = (
+            f"Single-batch overfit: {overfit_losses[0]:.4f} -> {final_overfit_loss:.4f} "
+            f"(target < 0.1)"
+        )
+        results.append(_check_result("Single-batch overfit", overfit_ok, detail))
+        icon = "✅" if overfit_ok else "❌"
+        print(f"  {icon} Check 4: {detail}")
+        if not overfit_ok:
+            print(f"       CRITICAL: Model cannot memorize a single batch!")
+            print(f"       This means the model or loss function has a bug.")
+        del overfit_model, overfit_optimizer
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        # ── Check 5: Tokenizer roundtrip ──
+        test_text = "The quick brown fox jumps over the lazy dog."
+        encoded = tokenizer.encode(test_text)
+        decoded = tokenizer.decode(encoded)
+        roundtrip_ok = test_text.strip() in decoded.strip()
+        detail = f"Roundtrip: '{test_text}' -> '{decoded.strip()}'"
+        results.append(_check_result("Tokenizer roundtrip", roundtrip_ok, detail))
+        icon = "✅" if roundtrip_ok else "❌"
+        print(f"  {icon} Check 5: {detail}")
+        # ── Check 6: Data quality sampling ──
+        print(f"\n  📋 Check 6: Data quality sampling (visual inspection)")
+        for i in range(min(3, input_ids.shape[0])):
+            sample_tokens = input_ids[i][:100].tolist()
+            decoded_text = tokenizer.decode(sample_tokens)
+            preview = decoded_text[:200].replace("\n", "\\n")
+            print(f"     Sample {i}: {preview}...")
+        passed_count = sum(1 for r in results if r["passed"])
+        total_count = len(results)
+        print(f"\n  Result: {passed_count}/{total_count} checks passed")
+        return {
+            "level": 1,
+            "checks": results,
+            "passed": [r for r in results if r["passed"]],
+            "failed": [r for r in results if not r["passed"]],
+        }
+    # ───────────────────────────────────────────────────────────────
+    # Level 2: Numerical Stability
+    # ───────────────────────────────────────────────────────────────
+    @staticmethod
+    def check_numerical_stability(
+        model: nn.Module,
+        dataloader: DataLoader,
+        device: torch.device,
+        dtype: torch.dtype = torch.bfloat16,
+    ) -> Dict[str, Any]:
+        """Check for NaN/Inf in gradients, activations, and logits (Level 2).
+        Checks:
+          - Mixed precision config (RMSNorm fp32 upcast, loss dtype)
+          - NaN/Inf gradients → softmax overflow, bad data
+          - Inf gradients → log(0) in loss, missing ignore_index
+          - Large activations growing per layer → initialization or norm bug
+          - Logit scale → should be < 1000
+        """
+        print(_header("Level 2: Numerical Stability Checks"))
+        batch = next(iter(dataloader))
+        input_ids = batch["input_ids"].to(device)
+        targets = batch["targets"].to(device)
+        results: List[Dict[str, Any]] = []
+        activation_stats: List[Dict[str, Any]] = []
+        # ── Mixed Precision Configuration Check ──
+        print("\n  Mixed Precision Config:")
+        print(f"    Training dtype: {dtype}")
+        # Check RMSNorm fp32 upcast
+        norm_fp32_ok = True
+        for name, module in model.named_modules():
+            cls_name = module.__class__.__name__
+            if "Norm" in cls_name:
+                # Inspect forward source for .float() call
+                import inspect
+                try:
+                    src = inspect.getsource(module.forward)
+                    has_upcast = ".float()" in src or "float32" in src
+                except (TypeError, OSError):
+                    has_upcast = True  # assume ok if can't inspect
+                if not has_upcast:
+                    norm_fp32_ok = False
+                    print(f"    🔴 {name} ({cls_name}): no fp32 upcast detected!")
+                break  # check one norm layer is enough
+        if norm_fp32_ok:
+            print(f"    ✅ Norm layers use fp32 upcast (safe)")
+        results.append(_check_result(
+            "Norm fp32 upcast", norm_fp32_ok,
+            "Norm computes in fp32" if norm_fp32_ok else "Norm may lose precision in half dtype",
+        ))
+        # Check loss computation dtype
+        loss_fp32_note = (
+            dtype in (torch.bfloat16, torch.float16)
+            and "cross_entropy" in str(type(model))
+        )
+        if dtype in (torch.bfloat16, torch.float16):
+            print(f"    ℹ️  Best practice: compute loss in fp32 when using {dtype}")
+            print(f"       logits_fp32 = logits.float()")
+            print(f"       loss = F.cross_entropy(logits_fp32.view(-1, V), targets.view(-1))")
+        # Common numerical issues reference
+        print("\n  Common Numerical Issues Reference:")
+        print("    ┌──────────────────────┬──────────────────────────┬─────────────────────────┐")
+        print("    │ Symptom              │ Likely Cause             │ Solution                │")
+        print("    ├──────────────────────┼──────────────────────────┼─────────────────────────┤")
+        print("    │ Loss → NaN           │ Large logits → softmax   │ Check init, logit scale │")
+        print("    │ Loss → Inf           │ log(0) in CE loss        │ Add eps, ignore_index   │")
+        print("    │ Loss oscillation     │ fp16 gradient underflow  │ Switch to bf16 / scaler │")
+        print("    │ Late-training NaN    │ Activation growth        │ Check RMSNorm, wd       │")
+        print("    └──────────────────────┴──────────────────────────┴─────────────────────────┘")
+        # ── Activation monitoring via hooks ──
+        hooks = []
+        def make_hook(name: str):
+            def hook_fn(module, input, output):
+                if isinstance(output, torch.Tensor):
+                    out_f = output.float()
+                    stats = {
+                        "name": name,
+                        "mean": out_f.mean().item(),
+                        "std": out_f.std().item(),
+                        "max": out_f.abs().max().item(),
+                        "has_nan": bool(torch.isnan(output).any()),
+                        "has_inf": bool(torch.isinf(output).any()),
+                    }
+                    activation_stats.append(stats)
+            return hook_fn
+        # Register hooks on transformer layers
+        for i, layer in enumerate(model.layers):
+            h = layer.register_forward_hook(make_hook(f"layer_{i}"))
+            hooks.append(h)
+        # ── Forward + Backward ──
+        model.train()
+        model.zero_grad(set_to_none=True)
+        with torch.amp.autocast(device_type="cuda", dtype=dtype, enabled=(dtype != torch.float32)):
+            logits, loss = model(input_ids, targets)
+        loss_val = loss.item()
+        loss_ok = not (math.isnan(loss_val) or math.isinf(loss_val))
+        results.append(_check_result(
+            "Loss value",
+            loss_ok,
+            f"Loss = {loss_val:.4f}" if loss_ok else f"Loss = {loss_val} (NaN/Inf!)"
+        ))
+        loss.backward()
+        # Remove hooks
+        for h in hooks:
+            h.remove()
+        # ── Gradient checks ──
+        print("\n  Gradient Health:")
+        grad_issues = []
+        for name, param in model.named_parameters():
+            if param.grad is None:
+                continue
+            grad = param.grad
+            if torch.isnan(grad).any():
+                grad_issues.append(f"🔴 NaN gradient: {name}")
+            if torch.isinf(grad).any():
+                grad_issues.append(f"🔴 Inf gradient: {name}")
+            if grad.abs().max().item() > 100:
+                grad_issues.append(
+                    f"🟡 Large gradient: {name} max={grad.abs().max().item():.1f}"
+                )
+        grad_ok = len(grad_issues) == 0
+        if grad_ok:
+            print("    ✅ All gradients are healthy (no NaN/Inf/large values)")
+        else:
+            for issue in grad_issues[:10]:  # limit output
+                print(f"    {issue}")
+            if len(grad_issues) > 10:
+                print(f"    ... and {len(grad_issues) - 10} more issues")
+        results.append(_check_result(
+            "Gradient health",
+            grad_ok,
+            f"{len(grad_issues)} issues found" if not grad_ok else "All healthy",
+        ))
+        # ── Activation checks ──
+        print("\n  Activation Stats (per transformer layer):")
+        act_nan_count = 0
+        for stats in activation_stats:
+            icon = "🔴" if stats["has_nan"] or stats["has_inf"] else "  "
+            if stats["has_nan"] or stats["has_inf"]:
+                act_nan_count += 1
+            print(
+                f"    {icon} {stats['name']}: "
+                f"mean={stats['mean']:.4f}, "
+                f"std={stats['std']:.4f}, "
+                f"max={stats['max']:.4f}"
+                + (" [NaN!]" if stats["has_nan"] else "")
+                + (" [Inf!]" if stats["has_inf"] else "")
+            )
+        act_ok = act_nan_count == 0
+        results.append(_check_result(
+            "Activation health",
+            act_ok,
+            f"{act_nan_count} layers with NaN/Inf" if not act_ok else "All layers healthy",
+        ))
+        # ── Logit scale check ──
+        logit_max = logits.float().abs().max().item()
+        logit_ok = logit_max < 1000
+        detail = f"Logit max abs value: {logit_max:.1f} (should be < 1000)"
+        results.append(_check_result("Logit scale", logit_ok, detail))
+        icon = "✅" if logit_ok else "🔴"
+        print(f"\n  {icon} Logit scale: {detail}")
+        model.zero_grad(set_to_none=True)
+        passed_count = sum(1 for r in results if r["passed"])
+        print(f"\n  Result: {passed_count}/{len(results)} checks passed")
+        return {
+            "level": 2,
+            "checks": results,
+            "activation_stats": activation_stats,
+            "grad_issues": grad_issues,
+        }
+    # ───────────────────────────────────────────────────────────────
+    # Level 3: Hyperparameter Diagnosis
+    # ───────────────────────────────────────────────────────────────
+    @staticmethod
+    def diagnose_hyperparameters(
+        metrics_history: Dict[str, list],
+        config: TrainConfig,
+    ) -> Dict[str, Any]:
+        """Analyze hyperparameter health from training metrics (Level 3).
+        Checks:
+          - LR: too high (grad_norm hitting clip limit) or too low (grad_norm tiny)
+          - Batch size: loss variance indicates batch too small
+          - Warmup: spikes in early steps indicate warmup too short
+        """
+        print(_header("Level 3: Hyperparameter Diagnosis"))
+        findings: List[Dict[str, str]] = []
+        grad_norms = metrics_history.get("grad_norm", [])
+        train_losses = metrics_history.get("train_loss", [])
+        # ── LR diagnosis ──
+        print("\n  Learning Rate Analysis:")
+        print(f"    Peak LR: {config.learning_rate:.2e}")
+        print(f"    Min LR:  {config.min_learning_rate:.2e}")
+        if grad_norms:
+            avg_grad = sum(grad_norms) / len(grad_norms)
+            clip_count = sum(1 for g in grad_norms if g >= config.grad_clip * 0.95)
+            clip_rate = clip_count / len(grad_norms)
+            tiny_count = sum(1 for g in grad_norms if g < 0.01)
+            tiny_rate = tiny_count / len(grad_norms)
+            print(f"    Avg grad norm:  {avg_grad:.4f}")
+            print(f"    Clip rate:      {clip_rate * 100:.1f}% (hitting max_norm={config.grad_clip})")
+            print(f"    Tiny grad rate: {tiny_rate * 100:.1f}% (< 0.01)")
+            if clip_rate > 0.3:
+                findings.append({
+                    "issue": "LR may be too high",
+                    "evidence": f"Grad norm hits clip limit {clip_rate * 100:.0f}% of the time",
+                    "action": f"Try LR = {config.learning_rate / 2:.2e} (÷2)",
+                })
+                print(f"    🟡 Grad clipping frequent ({clip_rate * 100:.0f}%) → LR may be too high")
+            elif tiny_rate > 0.5:
+                findings.append({
+                    "issue": "LR may be too low",
+                    "evidence": f"Grad norm < 0.01 in {tiny_rate * 100:.0f}% of steps",
+                    "action": f"Try LR = {config.learning_rate * 2:.2e} (×2)",
+                })
+                print(f"    🟡 Grad norm too small ({tiny_rate * 100:.0f}% < 0.01) → LR may be too low")
+            else:
+                print(f"    ✅ LR looks appropriate")
+        # ── Batch size diagnosis ──
+        print("\n  Batch Size Analysis:")
+        print(f"    Effective batch: {config.effective_batch_size}")
+        if len(train_losses) >= 20:
+            recent_losses = train_losses[-20:]
+            loss_mean = sum(recent_losses) / len(recent_losses)
+            loss_var = sum((x - loss_mean) ** 2 for x in recent_losses) / len(recent_losses)
+            loss_cv = (loss_var ** 0.5) / max(loss_mean, 1e-8)
+            print(f"    Recent loss CV: {loss_cv:.4f} (coefficient of variation)")
+            if loss_cv > 0.1:
+                findings.append({
+                    "issue": "Batch size may be too small",
+                    "evidence": f"Loss CV = {loss_cv:.4f} (high variance)",
+                    "action": "Increase gradient_accumulation_steps",
+                })
+                print(f"    🟡 High loss variance → batch may be too small")
+            else:
+                print(f"    ✅ Loss variance is acceptable")
+        # ── β₂ diagnosis ──
+        print("\n  β₂ (Adam second momentum) Analysis:")
+        print(f"    Current β₂: {config.beta2}")
+        if config.beta2 >= _DEFAULT_PYTORCH_BETA2:
+            findings.append({
+                "issue": "β₂ may be too high for LLM training",
+                "evidence": (
+                    f"β₂={config.beta2} (PyTorch default). "
+                    f"LLM standard is {_RECOMMENDED_BETA2}"
+                ),
+                "action": f"Set beta2={_RECOMMENDED_BETA2} (used by LLaMA, TinyLlama, OLMo)",
+            })
+            print(f"    🟡 β₂={config.beta2} is PyTorch default → "
+                  f"LLM training standard is {_RECOMMENDED_BETA2}")
+            print(f"       Why: β₂=0.999 averages ~1000 steps of gradient stats,")
+            print(f"       β₂=0.95 averages ~20 steps → faster adaptation to changing data")
+            print(f"       (Cattaneo & Shigida 2025, Compagnoni et al. 2025)")
+        else:
+            print(f"    ✅ β₂={config.beta2} is within LLM standard range")
+        # ── Weight Decay diagnosis ──
+        print("\n  Weight Decay Analysis:")
+        print(f"    Current weight_decay: {config.weight_decay}")
+        if config.weight_decay == 0:
+            findings.append({
+                "issue": "Weight decay is disabled",
+                "evidence": "weight_decay=0 increases overfitting risk",
+                "action": "Set weight_decay=0.1 (standard for LLaMA, TinyLlama, GPT-3, OLMo)",
+            })
+            print(f"    🟡 weight_decay=0 → overfitting risk. Standard is 0.1")
+        elif config.weight_decay > 0.3:
+            findings.append({
+                "issue": "Weight decay may be too high",
+                "evidence": f"weight_decay={config.weight_decay} (unusually high)",
+                "action": "Try weight_decay=0.1 (standard value)",
+            })
+            print(f"    🟡 weight_decay={config.weight_decay} is unusually high (standard: 0.1)")
+        else:
+            print(f"    ✅ weight_decay={config.weight_decay} is within normal range")
+        # ── Model-size LR reference ──
+        print("\n  GPT-3 LR Reference (Brown et al. 2020):")
+        print("    ┌──────────┬───────────┬──────────────┐")
+        print("    │ Model    │ Peak LR   │ Batch Tokens │")
+        print("    ├──────────┼───────────┼──────────────┤")
+        for params, lr, batch_tok in _GPT3_LR_REFERENCE:
+            label = f"{params / 1e9:.1f}B" if params >= 1e9 else f"{params / 1e6:.0f}M"
+            marker = " ←" if abs(params - 1.1e9) < 0.5e9 else ""
+            print(f"    │ {label:<8} │ {lr:.1e}  │ {batch_tok:<12} │{marker}")
+        print("    └──────────┴───────────┴──────────────┘")
+        print("    → Larger models need lower LR and larger batch")
+        # ── Batch-LR scaling guidance ──
+        print("\n  Batch-LR Scaling Rules:")
+        print("    • Batch ×2 → LR ×√2 (square root scaling, safer)")
+        print("    • Batch ×2 → LR ×2   (linear scaling, used by GPT-3)")
+        print("    • 1B model: effective batch 64~512 is typical range")
+        # ── Warmup diagnosis ──
+        print("\n  Warmup Analysis:")
+        print(f"    Warmup steps: {config.warmup_steps} "
+              f"({config.warmup_steps / config.total_steps * 100:.1f}% of total)")
+        if len(train_losses) >= 10:
+            early_losses = train_losses[:min(50, len(train_losses))]
+            # Detect spikes in early training
+            spike_count = 0
+            for i in range(1, len(early_losses)):
+                if early_losses[i] > early_losses[i - 1] * 1.5:
+                    spike_count += 1
+            if spike_count > 3:
+                findings.append({
+                    "issue": "Warmup may be too short",
+                    "evidence": f"{spike_count} loss spikes in first {len(early_losses)} steps",
+                    "action": f"Try warmup_steps = {config.warmup_steps * 2}",
+                })
+                print(f"    🟡 {spike_count} spikes in early training → warmup may be too short")
+            else:
+                print(f"    ✅ Early training is stable")
+        # ── Summary ──
+        if not findings:
+            print("\n  ✅ No hyperparameter issues detected")
+        else:
+            print(f"\n  Found {len(findings)} potential issue(s):")
+            for f in findings:
+                print(f"    • {f['issue']}: {f['action']}")
+        # ── Warmup reference from real projects ──
+        print("\n  Warmup Reference (real projects):")
+        print("    • TinyLlama 1.1B (3T tokens): 2,000 steps ≈ 0.1% of total")
+        print("    • GPT-3 175B:                  375 steps ≈ 0.2% of total")
+        print("    • General range: 0.1% ~ 5% of total steps")
+        print("    • Smaller experiments: 5~10% is also reasonable")
+        print("\n  Tuning priority (high → low):")
+        print("    1. Learning Rate  ← tune first (10x impact)")
+        print("    2. Batch Size     ← adjust with LR")
+        print("    3. Warmup Steps   ← early stability")
+        print("    4. Weight Decay   ← if overfitting (typically 0.1)")
+        print("    5. β₁, β₂ (Adam) ← see β₂ analysis above")
+        print("    6. Gradient Clip  ← usually keep at 1.0")
+        return {
+            "level": 3,
+            "findings": findings,
+            "config_summary": {
+                "learning_rate": config.learning_rate,
+                "effective_batch": config.effective_batch_size,
+                "warmup_steps": config.warmup_steps,
+                "total_steps": config.total_steps,
+                "grad_clip": config.grad_clip,
+            },
+        }
+    @staticmethod
+    def lr_range_test(
+        model: nn.Module,
+        dataloader: DataLoader,
+        device: torch.device,
+        dtype: torch.dtype = torch.bfloat16,
+        lr_start: float = 1e-7,
+        lr_end: float = 1e-1,
+        steps: int = 300,
+    ) -> Dict[str, Any]:
+        """Run an LR range test to find the optimal learning rate (Level 3 bonus).
+        Sweeps LR from lr_start to lr_end exponentially, recording loss.
+        The optimal LR is where loss decreases fastest (steepest slope),
+        divided by 3~10 for stability.
+        WARNING: This modifies a copy of the model. The original is untouched.
+        """
+        print(_header("Level 3 Bonus: LR Range Test"))
+        print(f"  Sweeping LR from {lr_start:.1e} to {lr_end:.1e} over {steps} steps...\n")
+        test_model = copy.deepcopy(model)
+        test_model.to(device)
+        test_model.train()
+        optimizer = torch.optim.AdamW(test_model.parameters(), lr=lr_start)
+        lr_mult = (lr_end / lr_start) ** (1 / steps)
+        lr = lr_start
+        lrs: List[float] = []
+        losses: List[float] = []
+        data_iter = iter(dataloader)
+        for step in range(steps):
+            for pg in optimizer.param_groups:
+                pg["lr"] = lr
+            try:
+                batch = next(data_iter)
+            except StopIteration:
+                data_iter = iter(dataloader)
+                batch = next(data_iter)
+            input_ids = batch["input_ids"].to(device)
+            targets_t = batch["targets"].to(device)
+            optimizer.zero_grad()
+            with torch.amp.autocast(device_type="cuda", dtype=dtype, enabled=(dtype != torch.float32)):
+                _, loss = test_model(input_ids, targets_t)
+            loss.backward()
+            optimizer.step()
+            loss_val = loss.item()
+            lrs.append(lr)
+            losses.append(loss_val)
+            if (step + 1) % 50 == 0:
+                print(f"    Step {step + 1}: LR = {lr:.2e}, Loss = {loss_val:.4f}")
+            # Stop if loss explodes
+            if len(losses) > 1 and loss_val > losses[0] * 4:
+                print(f"    Loss exploded at LR = {lr:.2e}, stopping.")
+                break
+            lr *= lr_mult
+        del test_model, optimizer
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        # Find steepest descent
+        best_lr = lr_start
+        if len(losses) > 10:
+            # Smooth losses and find steepest negative slope
+            window = 5
+            smoothed = []
+            for i in range(len(losses) - window):
+                smoothed.append(sum(losses[i:i + window]) / window)
+            min_slope = 0
+            min_idx = 0
+            for i in range(1, len(smoothed)):
+                slope = smoothed[i] - smoothed[i - 1]
+                if slope < min_slope:
+                    min_slope = slope
+                    min_idx = i
+            best_lr = lrs[min_idx]
+            suggested_lr = best_lr / 3  # conservative choice
+            print(f"\n  Steepest descent at LR = {best_lr:.2e}")
+            print(f"  Suggested peak LR:  {suggested_lr:.2e} (÷3 for stability)")
+            print(f"  Conservative range: [{best_lr / 10:.2e}, {best_lr / 3:.2e}]")
+        else:
+            suggested_lr = 3e-4
+            print(f"\n  Not enough data points. Using default LR = {suggested_lr:.2e}")
+        return {
+            "lrs": lrs,
+            "losses": losses,
+            "best_lr": best_lr,
+            "suggested_lr": suggested_lr,
+        }
+    # ───────────────────────────────────────────────────────────────
+    # Level 4: Overfitting vs Underfitting Diagnosis
+    # ───────────────────────────────────────────────────────────────
+    @staticmethod
+    def diagnose_fitting(
+        metrics_history: Dict[str, list],
+        model_params: Optional[int] = None,
+        total_tokens: Optional[int] = None,
+    ) -> Dict[str, Any]:
+        """Diagnose overfitting vs underfitting from metrics (Level 4).
+        Cases:
+          1. Both high, decreasing → Normal (still training)
+          2. Both high, plateau    → Underfitting
+          3. Train↓ Val→ or Val↑   → Overfitting
+          4. Both low, plateau     → Converged (or at limit)
+        """
+        print(_header("Level 4: Overfitting vs Underfitting Diagnosis"))
+        train_losses = metrics_history.get("train_loss", [])
+        val_losses = [v for v in metrics_history.get("val_loss", []) if v is not None]
+        if len(train_losses) < 10 or len(val_losses) < 2:
+            print("  [!] Not enough data. Need more training steps with eval.")
+            return {"level": 4, "case": "insufficient_data", "recommendations": []}
+        # Recent train trend
+        recent_n = min(50, len(train_losses))
+        train_recent = train_losses[-recent_n:]
+        train_mid = len(train_recent) // 2
+        train_first = sum(train_recent[:train_mid]) / max(train_mid, 1)
+        train_second = sum(train_recent[train_mid:]) / max(len(train_recent) - train_mid, 1)
+        train_decreasing = train_second < train_first - 0.02
+        # Val trend
+        val_mid = len(val_losses) // 2
+        val_first = sum(val_losses[:max(val_mid, 1)]) / max(val_mid, 1)
+        val_second = sum(val_losses[val_mid:]) / max(len(val_losses) - val_mid, 1)
+        val_decreasing = val_second < val_first - 0.02
+        val_increasing = val_second > val_first + 0.05
+        # Train-Val gap
+        last_train = train_losses[-1]
+        last_val = val_losses[-1]
+        gap = last_train - last_val  # negative means val > train (typical)
+        print(f"  Train loss (recent): {train_first:.4f} → {train_second:.4f} "
+              f"({'↓' if train_decreasing else '→'})")
+        print(f"  Val loss:            {val_first:.4f} → {val_second:.4f} "
+              f"({'↓' if val_decreasing else '↑' if val_increasing else '→'})")
+        print(f"  Train-Val gap:       {abs(gap):.4f}")
+        # ── Classify ──
+        case = ""
+        recommendations: List[str] = []
+        if train_decreasing and val_decreasing:
+            case = "Case 1: Normal — both decreasing"
+            recommendations.append("Training is progressing normally. Continue.")
+            if model_params and total_tokens:
+                ratio = total_tokens / model_params
+                chinchilla = 20  # Chinchilla optimal: 20 tokens per param
+                if ratio < chinchilla:
+                    recommendations.append(
+                        f"Token/param ratio = {ratio:.1f}x "
+                        f"(Chinchilla optimal ≈ {chinchilla}x). "
+                        f"Model may benefit from more data."
+                    )
+            print(f"\n  🟢 {case}")
+        elif not train_decreasing and not val_decreasing and last_train > _EXPECTED_TRAIN_LOSS[1]:
+            case = "Case 2: Underfitting — both plateaued at high loss"
+            recommendations = [
+                "Diagnosis priority (check in order):",
+                "1) Training insufficient? → check if loss curve still has downward slope",
+                "   - Chinchilla: 1B model needs ~20B tokens minimum",
+                "   - TinyLlama trains 1.1B on 3T tokens (inference-optimal)",
+                "2) LR too low? → try LR ×2, see if loss drops faster",
+                "3) Model capacity too small? → train 2x larger model on same data",
+                "   - If larger model gets lower loss → capacity was the limit",
+                "4) Data quality? → sample and read training data manually",
+                "   - Noisy/low-quality data raises the achievable loss floor",
+            ]
+            if model_params and total_tokens:
+                ratio = total_tokens / model_params
+                if ratio < 10:
+                    recommendations.insert(0,
+                        f"⚠ Token/param ratio = {ratio:.1f}x — "
+                        f"very likely undertrained. Chinchilla recommends ≥20x."
+                    )
+                elif ratio < 20:
+                    recommendations.insert(0,
+                        f"ℹ Token/param ratio = {ratio:.1f}x — "
+                        f"below Chinchilla optimal (20x). More tokens may help."
+                    )
+            print(f"\n  🟡 {case}")
+        elif train_decreasing and (val_increasing or not val_decreasing):
+            case = "Case 3: Overfitting — train↓ but val→/↑"
+            recommendations = [
+                "Diagnosis priority (check in order):",
+                "1) Data repetition? (most common cause in pretraining)",
+                "   - Check: total tokens vs unique tokens",
+                "   - Epoch > 1 dramatically increases overfitting risk",
+                "   - Solution: add more data, stay within 1 epoch",
+                "2) Weight decay too low?",
+                "   - Check: weight_decay value (standard: 0.1)",
+                "   - LLaMA, TinyLlama, OLMo, GPT-3 all use 0.1",
+                "   - Experiment: 0.01 / 0.05 / 0.1 / 0.3",
+                "3) Data diversity?",
+                "   - Single-domain data overfits faster",
+                "   - Mix: web, books, code, wiki, etc.",
+                "",
+                "Note on Dropout in LLM pretraining:",
+                "  - Modern LLMs do NOT use dropout in pretraining",
+                "    (Pythia, TinyLlama, OLMo, LLaMA all use dropout=0)",
+                "  - Sufficient data is the best regularization",
+                "  - Dropout is useful for fine-tuning on small datasets",
+            ]
+            print(f"\n  🟡 {case}")
+        else:
+            case = "Case 4: Converged — loss is low and stable"
+            recommendations = [
+                "Training has converged (or reached the data/model limit).",
+                "To push further: add more data or increase model size.",
+            ]
+            print(f"\n  🟢 {case}")
+        for rec in recommendations:
+            print(f"    {rec}")
+        return {
+            "level": 4,
+            "case": case,
+            "train_trend": "decreasing" if train_decreasing else "flat",
+            "val_trend": "decreasing" if val_decreasing else ("increasing" if val_increasing else "flat"),
+            "gap": abs(gap),
+            "recommendations": recommendations,
+        }
+    # ───────────────────────────────────────────────────────────────
+    # Level 5: Architecture Checks
+    # ───────────────────────────────────────────────────────────────
+    @staticmethod
+    def check_architecture(
+        model: nn.Module,
+        dataloader: DataLoader,
+        device: torch.device,
+    ) -> Dict[str, Any]:
+        """Check weight initialization and per-layer activation health (Level 5).
+        Healthy initialization:
+          - All layers: std ≈ 1.0, mean ≈ 0.0
+        Problems:
+          - std increasing per layer → activation explosion (init scale too large)
+          - std decreasing per layer → activation vanishing (init scale too small)
+          - Sudden change at specific layer → implementation bug in that layer
+        """
+        print(_header("Level 5: Architecture / Initialization Check"))
+        batch = next(iter(dataloader))
+        sample_input = batch["input_ids"][:1].to(device)
+        model.eval()
+        layer_stats: List[Dict[str, Any]] = []
+        with torch.no_grad():
+            h = model.token_embedding(sample_input)
+            emb_std = h.float().std().item()
+            print(f"\n  Embedding: std={emb_std:.4f}")
+            for i, layer in enumerate(model.layers):
+                h = layer(h, mask=None, position_offset=0)
+                h_f = h.float()
+                stats = {
+                    "layer": i,
+                    "mean": h_f.mean().item(),
+                    "std": h_f.std().item(),
+                    "max": h_f.abs().max().item(),
+                }
+                layer_stats.append(stats)
+        # Print stats
+        print(f"\n  Layer-by-layer activation statistics:")
+        print(f"  {'Layer':<8} {'Mean':>10} {'Std':>10} {'Max':>10}")
+        print(f"  {'-' * 38}")
+        for s in layer_stats:
+            print(f"  {s['layer']:<8} {s['mean']:>10.4f} {s['std']:>10.4f} {s['max']:>10.4f}")
+        # ── Weight initialization distribution check ──
+        print(f"\n  Weight Initialization Distribution:")
+        print(f"  {'Parameter':<40} {'Mean':>10} {'Std':>10} {'Shape'}")
+        print(f"  {'-' * 75}")
+        weight_issues = []
+        for name, param in model.named_parameters():
+            if param.ndim < 2:
+                continue  # skip biases, norm weights
+            p_f = param.float()
+            p_mean = p_f.mean().item()
+            p_std = p_f.std().item()
+            # Expected: std ≈ 0.02 for most layers, smaller for residual projections
+            shape_str = str(list(param.shape))
+            is_residual = "o_proj" in name or "down_proj" in name
+            expected_std = 0.02  # GPT-2 style
+            if p_std > expected_std * 5:
+                weight_issues.append(f"Large std: {name} (std={p_std:.4f})")
+                print(f"  🟡 {name:<38} {p_mean:>10.4f} {p_std:>10.4f} {shape_str}")
+            elif p_std < expected_std * 0.1:
+                weight_issues.append(f"Tiny std: {name} (std={p_std:.6f})")
+                print(f"  🟡 {name:<38} {p_mean:>10.4f} {p_std:>10.6f} {shape_str}")
+            else:
+                print(f"     {name:<38} {p_mean:>10.4f} {p_std:>10.4f} {shape_str}")
+        if weight_issues:
+            print(f"\n  ⚠ {len(weight_issues)} weight distribution issue(s) found")
+            for issue in weight_issues[:5]:
+                print(f"    • {issue}")
+        else:
+            print(f"\n  ✅ All weight distributions look normal (std ≈ 0.02)")
+        print(f"\n  Expected init pattern:")
+        print(f"    • General Linear: N(0, 0.02)")
+        print(f"    • Residual proj (o_proj, down_proj): N(0, 0.02/√(2×layers))")
+        print(f"    • Embedding: N(0, 0.02)")
+        # ── Ablation study guidance ──
+        print(f"\n  Component Ablation Reference:")
+        print("    ┌──────────────────────┬────────────────────────────────────┐")
+        print("    │ Experiment           │ Expected Outcome                   │")
+        print("    ├──────────────────────┼────────────────────────────────────┤")
+        print("    │ RMSNorm → LayerNorm  │ Minimal loss diff → OK            │")
+        print("    │ RoPE → Absolute PE   │ Similar on short seq (<512)       │")
+        print("    │ SwiGLU → ReLU FFN    │ Loss +0.05~0.15 → SwiGLU working │")
+        print("    │ GQA → MHA            │ Same loss, less memory → OK       │")
+        print("    └──────────────────────┴────────────────────────────────────┘")
+        print("    If any replacement shows unexpected results, check that component.")
+        # Analyze trends
+        stds = [s["std"] for s in layer_stats]
+        diagnosis = "healthy"
+        detail = ""
+        if len(stds) >= 3:
+            # Check for monotonic increase/decrease
+            first_third = sum(stds[:len(stds) // 3]) / (len(stds) // 3)
+            last_third = sum(stds[-(len(stds) // 3):]) / (len(stds) // 3)
+            ratio = last_third / max(first_third, 1e-8)
+            if ratio > 5:
+                diagnosis = "exploding"
+                detail = (
+                    f"Activation std grows {ratio:.1f}x from early to late layers. "
+                    f"Init scale may be too large."
+                )
+            elif ratio < 0.2:
+                diagnosis = "vanishing"
+                detail = (
+                    f"Activation std shrinks to {ratio:.1f}x from early to late layers. "
+                    f"Init scale may be too small."
+                )
+            else:
+                detail = f"Std ratio (last/first third) = {ratio:.2f} — within normal range."
+            # Check for sudden jumps
+            for i in range(1, len(stds)):
+                jump = stds[i] / max(stds[i - 1], 1e-8)
+                if jump > 10 or jump < 0.1:
+                    diagnosis = "anomaly"
+                    detail = (
+                        f"Sudden activation change at layer {i}: "
+                        f"std {stds[i - 1]:.4f} → {stds[i]:.4f}. "
+                        f"Possible implementation bug in that layer."
+                    )
+                    break
+        icon = {"healthy": "✅", "exploding": "🔴", "vanishing": "🟡", "anomaly": "🔴"}
+        print(f"\n  {icon.get(diagnosis, '⚪')} Diagnosis: {diagnosis}")
+        print(f"  {detail}")
+        return {
+            "level": 5,
+            "diagnosis": diagnosis,
+            "detail": detail,
+            "layer_stats": layer_stats,
+            "weight_issues": weight_issues,
+        }
+    # ───────────────────────────────────────────────────────────────
+    # Scenario Auto-Detection
+    # ───────────────────────────────────────────────────────────────
+    @staticmethod
+    def detect_scenario(
+        metrics_history: Dict[str, list],
+        vocab_size: int = 32000,
+    ) -> Dict[str, Any]:
+        """Auto-detect which debugging scenario applies.
+        Scenarios (from the guide):
+          A: Loss stuck at ~10.37 (doesn't decrease at all)
+          B: Loss was decreasing then suddenly NaN
+          C: Loss decreased to X then started increasing
+          D: Loss stuck at high value (e.g. 4.0 for 1B model)
+        """
+        print(_header("Scenario Auto-Detection"))
+        train_losses = metrics_history.get("train_loss", [])
+        val_losses = [v for v in metrics_history.get("val_loss", []) if v is not None]
+        expected_initial = math.log(vocab_size)
+        if len(train_losses) < 5:
+            print("  [!] Not enough data to detect scenario.")
+            return {"scenario": "unknown", "steps": []}
+        first_loss = train_losses[0]
+        last_loss = train_losses[-1]
+        has_nan = any(math.isnan(l) for l in train_losses)
+        min_loss = min(l for l in train_losses if not math.isnan(l))
+        min_loss_idx = next(i for i, l in enumerate(train_losses) if l == min_loss)
+        loss_recovered = last_loss > min_loss + 0.3 and min_loss_idx < len(train_losses) * 0.8
+        scenario = "unknown"
+        steps: List[str] = []
+        # Scenario A: Loss stuck near initial value
+        if abs(last_loss - expected_initial) < 1.5 and abs(first_loss - last_loss) < 0.5:
+            scenario = "A"
+            steps = [
+                "1. Run single-batch overfit test → if it fails, model/loss has a bug",
+                "2. Check if gradients are zero → optimizer.step() may be missing",
+                "3. Verify input_ids/targets shift → data pipeline bug",
+                "4. Check LR → is it set to 0?",
+                "5. Check model.train() → eval mode changes norm/dropout behavior",
+            ]
+        # Scenario B: NaN appeared
+        elif has_nan:
+            nan_idx = next(i for i, l in enumerate(train_losses) if math.isnan(l))
+            scenario = "B"
+            steps = [
+                f"1. NaN appeared at step ~{nan_idx}. Check that batch's data for bad tokens",
+                "2. Check gradient norm just before NaN → was there a spike?",
+                "3. Check LR schedule → does NaN coincide with warmup end?",
+                "4. Check specific layer weights for Inf values",
+                "5. Try switching to fp32 to see if it's a mixed precision issue",
+                "   (Pythia-1B had irrecoverable fp16 loss spikes → switched to bf16,",
+                "    Biderman et al. 2023)",
+            ]
+        # Scenario C: Loss decreased then increased
+        elif loss_recovered:
+            scenario = "C"
+            steps = [
+                "1. Check Train and Val loss simultaneously:",
+                "   - Both increasing → LR too high (check cosine decay)",
+                "   - Only train increasing → data quality changed (streaming order)",
+                "   - Only val increasing → overfitting started",
+                "2. Verify LR schedule is decaying as intended",
+                "3. Check data shuffling → same data repeating?",
+            ]
+        # Scenario D: Loss stuck at high value
+        elif last_loss > _EXPECTED_TRAIN_LOSS[1] and abs(last_loss - min_loss) < 0.3:
+            scenario = "D"
+            total_tokens = len(train_losses) * 262144  # approximate
+            steps = [
+                f"1. Check total tokens trained: ~{total_tokens / 1e9:.1f}B "
+                f"(need 5-10B for 1B model)",
+                "2. Compare with smaller model (100M) at same step → "
+                "if 100M is lower, 1B may have a bug",
+                "3. Run LR range test → current LR may not be optimal",
+                "4. Sample training data → check for noise, duplicates, low quality",
+                "5. Try different effective batch size (64 vs 128 vs 256)",
+            ]
+        else:
+            scenario = "none"
+            steps = ["Training appears normal. No specific scenario detected."]
+        label = {
+            "A": "Loss stuck at initial value (~10.37)",
+            "B": "Loss was decreasing, then NaN",
+            "C": "Loss decreased then started increasing",
+            "D": f"Loss stuck at high value (>{_EXPECTED_TRAIN_LOSS[1]})",
+            "none": "No problematic scenario detected",
+            "unknown": "Cannot determine",
+        }
+        print(f"\n  Detected: Scenario {scenario} — {label.get(scenario, 'Unknown')}")
+        print(f"\n  Recommended debugging steps:")
+        for step in steps:
+            print(f"    {step}")
+        return {"scenario": scenario, "label": label.get(scenario, ""), "steps": steps}
+    # ───────────────────────────────────────────────────────────────
+    # Main Entry Point
+    # ───────────────────────────────────────────────────────────────
+    @staticmethod
+    def run_diagnostics(
+        model: nn.Module,
+        dataloader: DataLoader,
+        tokenizer: Any,
+        train_config: TrainConfig,
+        metrics_history: Dict[str, list],
+        device: torch.device,
+        dtype: torch.dtype = torch.bfloat16,
+        vocab_size: int = 32000,
+        levels: Optional[List[int]] = None,
+    ) -> Dict[str, Any]:
+        """Run the full 5-level debugging framework.
+        Args:
+            model: the LLM model
+            dataloader: training dataloader
+            tokenizer: tokenizer with encode/decode methods
+            train_config: TrainConfig instance
+            metrics_history: dict from MetricsTracker.history
+            device: torch device
+            dtype: mixed precision dtype
+            vocab_size: model vocabulary size
+            levels: which levels to run (default: all [0,1,2,3,4,5])
+        Returns:
+            Full diagnostic report dict.
+        """
+        if levels is None:
+            levels = [0, 1, 2, 3, 4, 5]
+        print("\n" + "═" * 60)
+        print("  LLM Loss Debugging Framework")
+        print("  Levels to run: " + ", ".join(str(l) for l in levels))
+        print("═" * 60)
+        report: Dict[str, Any] = {}
+        if 0 in levels:
+            report["level_0"] = LossDebugger.diagnose_status(vocab_size, metrics_history)
+            # If status is normal and only level 0 was explicitly requested, skip rest
+            if (
+                report["level_0"]["status"] == STATUS_NORMAL
+                and levels == [0]
+            ):
+                print("\n  Training is healthy — no further debugging needed.")
+                return report
+        if 1 in levels:
+            report["level_1"] = LossDebugger.check_data_pipeline(
+                model, dataloader, tokenizer, vocab_size, device, dtype,
+            )
+        if 2 in levels:
+            report["level_2"] = LossDebugger.check_numerical_stability(
+                model, dataloader, device, dtype,
+            )
+        if 3 in levels:
+            report["level_3"] = LossDebugger.diagnose_hyperparameters(
+                metrics_history, train_config,
+            )
+        if 4 in levels:
+            model_params = sum(p.numel() for p in model.parameters())
+            total_tokens = len(metrics_history.get("train_loss", [])) * train_config.tokens_per_step
+            report["level_4"] = LossDebugger.diagnose_fitting(
+                metrics_history, model_params, total_tokens,
+            )
+        if 5 in levels:
+            report["level_5"] = LossDebugger.check_architecture(
+                model, dataloader, device,
+            )
+        # Auto-detect scenario
+        report["scenario"] = LossDebugger.detect_scenario(metrics_history, vocab_size)
+        # Final summary
+        print("\n" + "═" * 60)
+        print("  Diagnostics Complete")
+        print("═" * 60)
+        return report
+    # ───────────────────────────────────────────────────────────────
+    # Study Roadmap
+    # ───────────────────────────────────────────────────────────────
+    @staticmethod
+    def print_study_roadmap() -> None:
+        """Print the recommended study roadmap for LLM training optimization."""
+        print(_header("Study Roadmap — LLM Training Optimization"))
+        print("""
+  ⭐⭐⭐ Top Priority: Optimization Fundamentals
+  ─────────────────────────────────────────────
+  1. SGD → Momentum → Adam → AdamW progression
+     - Why Adam > SGD? Why decouple weight decay in AdamW?
+     - β₁, β₂ intuition (1st / 2nd momentum)
+     - Ref: Loshchilov & Hutter 2019 (AdamW)
+     - Ref: Karpathy "A Recipe for Training Neural Networks"
+  2. Loss Landscape
+     - Why large LR diverges, small LR stalls
+     - Batch size effect on landscape exploration
+     - Ref: Li et al. 2018 "Visualizing the Loss Landscape"
+     - Ref: McCandlish et al. 2018 "Large-Batch Training"
+  3. Chinchilla Scaling Law
+     - Loss = f(N, D) relationship
+     - Compute-optimal model size vs data allocation
+     - Ref: Hoffmann et al. 2022 (original)
+     - Ref: Kaplan et al. 2020 (predecessor)
+     - Ref: Besiroglu et al. 2024 (replication/verification)
+  ⭐⭐ Important: Training Stability
+  ──────────────────────────────────
+  4. Gradient Flow: vanishing/exploding, residual as gradient highway
+  5. Weight Init: Xavier / Kaiming / GPT-2 style
+  6. Normalization: BatchNorm → LayerNorm → RMSNorm
+  7. Weight Decay: L2 vs decoupled, why exclude embed/norm
+  ⭐ Advanced: Optimization Techniques
+  ─────────────────────────────────────
+  8. LR Schedules: cosine vs linear vs step, warmup/cooldown
+  9. Gradient Accumulation & Large Batch Training
+  10. μP (Maximal Update Parameterization): transfer HP across scales
+  Recommended Experiments (in order):
+  ───────────────────────────────────
+  1. Single-batch overfit         (30 min)  → basic sanity
+  2. LR Range Test                (1 hour)  → optimal LR range
+  3. 10M model quick train        (2-3 hrs) → pipeline validation
+  4. Ablation (remove components) (1 day)   → component contribution
+  5. 100M model + 5B tokens       (1-2 days)→ mid-scale dynamics
+  6. 1B model full training       (2-3 days)→ scaling law verification
+  7. LR / batch size comparison   (1 day)   → HP sensitivity
+  Key References:
+  ───────────────
+  ⭐⭐⭐ Karpathy "Recipe for Training NNs"    — debugging mindset
+  ⭐⭐⭐ Hoffmann et al. 2022 (Chinchilla)      — scaling law
+  ⭐⭐  Touvron et al. 2023 (LLaMA)            — 1B+ training details
+  ⭐⭐  Biderman et al. 2023 (Pythia)           — open training logs
+  ⭐⭐  Zhang et al. 2024 (TinyLlama)           — 1.1B on 3T tokens
+  ⭐⭐  Groeneveld et al. 2024 (OLMo)           — fully open LLM
+  ⭐⭐  Li et al. 2018 (Loss Landscape)         — loss terrain intuition
+  ⭐⭐  Loshchilov & Hutter 2019 (AdamW)        — optimizer basics
+  ⭐   Yang et al. 2022 (μP)                    — HP transfer
+  ⭐   McCandlish et al. 2018 (Batch size)      — critical batch size
+""")

notebooks/05_debugging.ipynb ADDED Viewed

	@@ -0,0 +1,384 @@

+{
+ "nbformat": 4,
+ "nbformat_minor": 4,
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 05. Loss Debugging (5-Level Diagnostic Framework)\n",
+    "\n",
+    "학습 중 loss가 예상대로 줄어들지 않을 때, 체계적으로 원인을 진단하는 프레임워크입니다.\n",
+    "\n",
+    "**항상 낮은 레벨부터 점검하세요** — 하이퍼파라미터를 튜닝하기 전에 데이터 버그를 먼저 잡아야 합니다.\n",
+    "\n",
+    "```\n",
+    "Level 0: Status Diagnosis     ← 현재 학습 상태 분류 (6가지)\n",
+    "Level 1: Data / Implementation ← 가장 흔한 원인 (70%)\n",
+    "Level 2: Numerical Stability   ← NaN/Inf, activation 폭발\n",
+    "Level 3: Hyperparameters       ← LR, batch size, warmup\n",
+    "Level 4: Fitting Diagnosis     ← overfitting vs underfitting\n",
+    "Level 5: Architecture          ← 초기화, 레이어별 activation\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 추가 패키지 설치 불필요\n",
+    "# LossDebugger는 torch와 llm_lab 내장 모듈만 사용합니다"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.insert(0, '..')\n",
+    "\n",
+    "import math\n",
+    "import torch\n",
+    "\n",
+    "from llm_lab.config import ModelConfig, DataConfig, TrainConfig\n",
+    "from llm_lab.model import LLMModel\n",
+    "from llm_lab.data import setup_data_pipeline\n",
+    "from llm_lab.training import LossDebugger\n",
+    "from llm_lab.utils import auto_configure, get_device"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 0. 설정\n",
+    "\n",
+    "`debug_10m` 프리셋을 사용하여 CPU에서도 빠르게 실행할 수 있도록 합니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# --- Config ---\n",
+    "model_config = ModelConfig.debug_10m()\n",
+    "data_config = DataConfig(\n",
+    "    max_seq_len=model_config.max_seq_len,\n",
+    "    batch_size=4,\n",
+    ")\n",
+    "train_config = TrainConfig.debug_10m()\n",
+    "\n",
+    "# --- Device / dtype ---\n",
+    "train_config = auto_configure(train_config)\n",
+    "device = get_device()\n",
+    "dtype = train_config.torch_dtype\n",
+    "\n",
+    "vocab_size = data_config.vocab_size\n",
+    "print(f\"Device: {device}, dtype: {dtype}\")\n",
+    "print(f\"Vocab size: {vocab_size:,}\")\n",
+    "print(f\"Expected initial loss: ln({vocab_size}) = {math.log(vocab_size):.2f}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# --- Model ---\n",
+    "model = LLMModel(model_config).to(device)\n",
+    "print(f\"Model parameters: {model.count_parameters():,}\")\n",
+    "\n",
+    "# --- Data pipeline ---\n",
+    "tokenizer, train_dl, val_dl = setup_data_pipeline(\n",
+    "    tokenizer_mode=\"pretrained\",\n",
+    "    config=data_config,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 0.1 학습 이력 (Mock Metrics History)\n",
+    "\n",
+    "실제 학습 없이도 Level 0 / 3 / 4 / 시나리오 감지를 테스트할 수 있도록\n",
+    "mock `metrics_history`를 제공합니다.\n",
+    "\n",
+    "실제 학습 후에는 `trainer.metrics.history`를 대신 사용하세요:\n",
+    "```python\n",
+    "# metrics_history = trainer.metrics.history\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import random\nrandom.seed(42)\n\nexpected_initial = math.log(vocab_size)  # ~10.37\n\n# --- Scenario A: loss stuck near ln(vocab_size) ---\n# loss가 거의 줄어들지 않는 상황 (데이터/구현 버그 의심)\nn_steps_a = 200\nmock_history_a = {\n    \"step\":          list(range(1, n_steps_a + 1)),\n    \"train_loss\":    [expected_initial - 0.01 * i + random.uniform(-0.05, 0.05)\n                      for i in range(n_steps_a)],\n    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_a)],\n    \"grad_norm\":     [random.uniform(0.3, 0.8) for _ in range(n_steps_a)],\n    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_a)],\n    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_a)],\n    \"val_loss\":      [expected_initial + random.uniform(-0.1, 0.1)\n                      for _ in range(0, n_steps_a, 50)],\n    \"val_ppl\":       [math.exp(expected_initial) + random.uniform(-50, 50)\n                      for _ in range(0, n_steps_a, 50)],\n}\nprint(f\"Mock A — train_loss: {mock_history_a['train_loss'][0]:.2f} -> {mock_history_a['train_loss'][-1]:.2f}\")\nprint(f\"  Expected: NO_DECREASE (loss barely changes)\")\n\n# --- Scenario B: loss decreasing then NaN ---\nn_steps_b = 200\nmock_history_b = {\n    \"step\":          list(range(1, n_steps_b + 1)),\n    \"train_loss\":    [expected_initial - 0.03 * i + random.uniform(-0.05, 0.05)\n                      for i in range(150)]\n                     + [float('nan')] * 50,\n    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_b)],\n    \"grad_norm\":     [random.uniform(0.3, 0.8) for _ in range(145)]\n                     + [random.uniform(5.0, 50.0) for _ in range(5)]\n                     + [float('nan')] * 50,\n    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_b)],\n    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_b)],\n    \"val_loss\":      [expected_initial - 0.03 * i + random.uniform(-0.1, 0.1)\n                      for i in range(0, n_steps_b, 50)],\n    \"val_ppl\":       [math.exp(expected_initial - 0.03 * i)\n                      for i in range(0, n_steps_b, 50)],\n}\nprint(f\"\\nMock B — train_loss starts normal, then NaN at step ~150\")\nprint(f\"  Expected: Scenario B (NaN detected)\")\n\n# --- Scenario C: loss decreased then increased ---\nn_steps_c = 200\nmock_history_c = {\n    \"step\":          list(range(1, n_steps_c + 1)),\n    \"train_loss\":    [expected_initial - 0.04 * i + random.uniform(-0.05, 0.05)\n                      for i in range(120)]\n                     + [expected_initial - 0.04 * 120 + 0.02 * (i - 120) + random.uniform(-0.05, 0.05)\n                        for i in range(120, n_steps_c)],\n    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_c)],\n    \"grad_norm\":     [random.uniform(0.3, 0.8) for _ in range(n_steps_c)],\n    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_c)],\n    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_c)],\n    \"val_loss\":      [expected_initial - 0.02 * i + random.uniform(-0.1, 0.1)\n                      for i in range(0, 120, 50)]\n                     + [expected_initial - 0.02 * 120 + 0.03 * (i - 120) + random.uniform(-0.1, 0.1)\n                        for i in range(120, n_steps_c, 50)],\n    \"val_ppl\":       [math.exp(expected_initial - 0.02 * i)\n                      for i in range(0, n_steps_c, 50)],\n}\nprint(f\"\\nMock C — train_loss: decrease → increase (bounce)\")\nprint(f\"  Expected: Scenario C (loss recovery)\")\n\n# --- Scenario D: loss stuck at high value (4.0) ---\nn_steps_d = 200\nmock_history_d = {\n    \"step\":          list(range(1, n_steps_d + 1)),\n    \"train_loss\":    [4.0 + random.uniform(-0.1, 0.1) for _ in range(n_steps_d)],\n    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_d)],\n    \"grad_norm\":     [random.uniform(0.1, 0.3) for _ in range(n_steps_d)],\n    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_d)],\n    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_d)],\n    \"val_loss\":      [4.2 + random.uniform(-0.1, 0.1)\n                      for _ in range(0, n_steps_d, 50)],\n    \"val_ppl\":       [math.exp(4.2) + random.uniform(-5, 5)\n                      for _ in range(0, n_steps_d, 50)],\n}\nprint(f\"\\nMock D — train_loss stuck at ~4.0\")\nprint(f\"  Expected: Scenario D (plateau at high value)\")\n\n# --- Normal: loss decreasing normally ---\nn_steps_n = 200\nmock_history_normal = {\n    \"step\":          list(range(1, n_steps_n + 1)),\n    \"train_loss\":    [expected_initial - 0.03 * i + random.uniform(-0.05, 0.05)\n                      for i in range(n_steps_n)],\n    \"learning_rate\": [1e-3 * min(i / 200, 1.0) for i in range(n_steps_n)],\n    \"grad_norm\":     [random.uniform(0.3, 0.8) for _ in range(n_steps_n)],\n    \"tokens_per_sec\":[random.uniform(8000, 12000) for _ in range(n_steps_n)],\n    \"gpu_mem_gb\":    [random.uniform(1.0, 2.0) for _ in range(n_steps_n)],\n    \"val_loss\":      [expected_initial - 0.03 * i + random.uniform(-0.1, 0.1)\n                      for i in range(0, n_steps_n, 50)],\n    \"val_ppl\":       [math.exp(expected_initial - 0.03 * i)\n                      for i in range(0, n_steps_n, 50)],\n}\nprint(f\"\\nMock Normal — train_loss: {mock_history_normal['train_loss'][0]:.2f} -> {mock_history_normal['train_loss'][-1]:.2f}\")\nprint(f\"  Expected: NORMAL (loss decreasing steadily)\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Level 0 — Status Diagnosis (상태 진단)\n",
+    "\n",
+    "`metrics_history`만으로 현재 학습 상태를 6가지 중 하나로 분류합니다:\n",
+    "\n",
+    "| Status | 의미 | 심각도 |\n",
+    "|--------|------|--------|\n",
+    "| `NORMAL` | 정상 학습 중 | green |\n",
+    "| `NO_DECREASE` | loss가 줄지 않음 | red |\n",
+    "| `DIVERGING` | loss가 발산 | red |\n",
+    "| `PLATEAU` | loss가 높은 값에서 정체 | yellow |\n",
+    "| `OVERFITTING` | train�� val↑ | yellow |\n",
+    "| `UNSTABLE` | loss 변동이 큼 | yellow |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Scenario A (문제 상황)\n",
+    "status_a = LossDebugger.diagnose_status(vocab_size, mock_history_a)\n",
+    "print(f\"\\n>>> Result: {status_a['status']}\")\n",
+    "\n",
+    "print(\"\\n\" + \"-\" * 40)\n",
+    "\n",
+    "# Normal (정상 상황)\n",
+    "status_n = LossDebugger.diagnose_status(vocab_size, mock_history_normal)\n",
+    "print(f\"\\n>>> Result: {status_n['status']}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Level 1 — Data / Implementation Checks (데이터·구현 점검)\n",
+    "\n",
+    "Loss 문제의 **70%는 데이터 파이프라인 버그**에서 발생합니다. 6가지를 체크합니다:\n",
+    "\n",
+    "1. **Shift 관계** — `targets[t] == input_ids[t+1]` 인지 확인\n",
+    "2. **토큰 범위** — `0 <= ids < vocab_size` 인지 확인\n",
+    "3. **초기 loss** — 랜덤 가중치에서 `≈ ln(vocab_size)` 인지 확인\n",
+    "4. **단일 배치 오버피팅** — 한 배치를 200스텝 반복 → loss ≈ 0 도달 확인\n",
+    "5. **토크나이저 왕복** — encode → decode 시 텍스트 보존 확인\n",
+    "6. **데이터 품질** — 샘플 텍스트 육안 확인"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "level1 = LossDebugger.check_data_pipeline(\n",
+    "    model=model,\n",
+    "    dataloader=train_dl,\n",
+    "    tokenizer=tokenizer,\n",
+    "    vocab_size=vocab_size,\n",
+    "    device=device,\n",
+    "    dtype=dtype,\n",
+    ")\n",
+    "\n",
+    "print(f\"\\nPassed: {len(level1['passed'])}, Failed: {len(level1['failed'])}\")\n",
+    "for f in level1['failed']:\n",
+    "    print(f\"  FAILED: {f['name']} — {f['detail']}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 3. Level 2 — Numerical Stability (수치 안정성)\n\nForward + Backward 1회를 실행하여 다음을 점검합니다:\n\n- **Mixed Precision 설정** — RMSNorm이 fp32로 연산하는지, loss dtype 확인\n- **Gradient** — NaN/Inf/large gradient 여부\n- **Activation** — 각 transformer 레이어의 mean/std/max\n- **Logit scale** — 출력 logit이 합리적 범위인지 (< 1000)\n\n### Mixed Precision Best Practices\n\n```python\n# ❌ 위험: bf16에서 큰 값 + 작은 값 덧셈 → 작은 값 소실\n# ✅ 해결: RMSNorm 내부는 float32로 계산\nx_float = x.float()  # bf16 → fp32\nrms = torch.rsqrt(x_float.pow(2).mean(-1, keepdim=True) + eps)\nreturn (x_float * rms).to(x.dtype) * self.weight\n\n# ✅ Loss 계산도 float32로:\nlogits_fp32 = logits.float()\nloss = F.cross_entropy(logits_fp32.view(-1, V), targets.view(-1))\n```\n\n### 흔한 수치 문제\n\n| 증상 | 원인 | 해결 |\n|------|------|------|\n| Loss → NaN | Softmax에 매우 큰 값 입력 | Logit 스케일 확인, 초기화 검토 |\n| Loss → Inf | 0에 대한 log 연산 | eps 추가, ignore_index 설정 |\n| Loss 진동 심함 | fp16에서 gradient underflow | bf16 전환 또는 GradScaler 사용 |\n| 학습 후반 NaN | 활성화 값 점진적 증가 | RMSNorm 작동 확인, weight decay 적용 확인 |"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "level2 = LossDebugger.check_numerical_stability(\n",
+    "    model=model,\n",
+    "    dataloader=train_dl,\n",
+    "    device=device,\n",
+    "    dtype=dtype,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 4. Level 3 — Hyperparameter Diagnosis (하이퍼파라미터 진단)\n\n`metrics_history`와 `TrainConfig`를 분석하여 다음을 점검합니다:\n\n- **LR**: grad norm이 clip limit에 자주 도달하면 → LR이 너무 높음\n- **Batch size**: loss 변동이 크면 → batch가 너무 작음\n- **Warmup**: 초기 loss spike가 많으면 → warmup이 너무 짧음\n- **β₂**: PyTorch 기본값(0.999) 대신 LLM 표준(0.95) 사용 여부\n- **Weight Decay**: 0이면 오버피팅 위험, 표준은 0.1\n\n### 튜닝 우선순위 (영향도 순)\n\n| 순위 | 파라미터 | 영향력 | 비고 |\n|------|----------|--------|------|\n| 1 | **Learning Rate** | 10x | 반드시 먼저 튜닝 |\n| 2 | **Batch Size** | 높음 | LR과 함께 조정 |\n| 3 | **Warmup Steps** | 중간 | 초반 안정성 |\n| 4 | **Weight Decay** | 중간 | 오버피팅 시 조정 (보통 0.1 고정) |\n| 5 | **β₁, β₂** | 낮음 | β₂=0.95 권장 (LLaMA, TinyLlama, OLMo) |\n| 6 | **Gradient Clip** | 낮음 | 보통 1.0 고정 |\n\n### β₂ 선택 가이드\n\n- **PyTorch 기본값** β₂=0.999 → 최근 ~1000 step의 gradient 통계 반영\n- **LLM 표준** β₂=0.95 → 최근 ~20 step의 gradient 통계 반영\n- LLM 학습은 데이터 분포가 계속 바뀌므로 빠른 적응(낮은 β₂)��� 유리\n- 낮은 β₂가 loss spike 완화에 도움 (Cattaneo & Shigida 2025)\n\n### Batch-LR 스케일링\n\n- Batch ×2 → LR ×√2 (square root scaling, 더 안전)\n- Batch ×2 → LR ×2 (linear scaling, GPT-3 등에서 사용)\n- 1B 모델 기준 effective batch 64~512이 일반적"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "level3 = LossDebugger.diagnose_hyperparameters(\n",
+    "    metrics_history=mock_history_a,\n",
+    "    config=train_config,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 4.1 LR Range Test\n",
+    "\n",
+    "LR을 `1e-7` → `1e-1`로 지수적으로 증가시키며 loss를 기록합니다.\n",
+    "loss가 가장 빠르게 줄어드는 지점의 LR ÷ 3이 권장 peak LR입니다.\n",
+    "\n",
+    "> 시간이 오래 걸릴 수 있습니다 (debug_10m 기준 ~1분).\n",
+    "> 실제 학습 전 한 번만 실행하면 됩니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lr_result = LossDebugger.lr_range_test(\n",
+    "    model=model,\n",
+    "    dataloader=train_dl,\n",
+    "    device=device,\n",
+    "    dtype=dtype,\n",
+    "    steps=100,  # debug_10m: 100 steps for speed\n",
+    ")\n",
+    "\n",
+    "print(f\"\\nSuggested peak LR: {lr_result['suggested_lr']:.2e}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 5. Level 4 — Fitting Diagnosis (피팅 진단)\n\nTrain/Val loss 추세를 비교하여 4가지 케이스를 판별합니다:\n\n| Case | Train | Val | 진단 |\n|------|-------|-----|------|\n| 1 | ↓ | ↓ | Normal (학습 진행 중) |\n| 2 | → | → | Underfitting (모델/데이터 부족) |\n| 3 | ↓ | ↑ | Overfitting (데이터 반복 의심) |\n| 4 | → | → (낮음) | Converged (수렴 완료) |\n\n### 언더피팅 원인 분석 순서\n\n1. **모델 용량 부족?** → 2x 큰 모델로 같은 데이터 학습 → loss 낮아지면 용량 문제\n2. **학습 불충분?** → loss 곡선이 아직 감소 추세인지 확인 (Chinchilla: 1B → ~20B 토큰)\n3. **LR이 너무 작아서 수렴이 느린 것?** → LR ×2로 실험\n4. **데이터 품질 문제?** → 직접 샘플링해서 읽어보기\n\n### 오버피팅 원인 분석 순서\n\n1. **데이터 부족/반복?** → epoch > 1이면 오버피팅 위험 급증\n2. **Weight Decay 부족?** → 0.1이 표준 (LLaMA, TinyLlama, GPT-3, OLMo)\n3. **데이터 다양성 부족?** → 다양한 도메인 혼합 필요\n\n### LLM Pretraining에서 오버피팅에 대한 중요 사실\n\n- **1 에폭 이내** 학습 시 오버피팅은 매우 드묾\n- 오버피팅이 보이면 대부분 **데이터 반복**(에폭 > 1)이 원인\n- **Dropout**은 현대 LLM pretraining에서 **거의 사용하지 않음**\n  - Pythia, TinyLlama, OLMo, LLaMA 모두 dropout=0\n  - 충분한 데이터가 있으면 데이터 자체가 최고의 정규화\n  - Dropout은 fine-tuning에서 소량 데이터 학습 시 유효"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_params = sum(p.numel() for p in model.parameters())\n",
+    "total_tokens = len(mock_history_normal[\"train_loss\"]) * train_config.tokens_per_step\n",
+    "\n",
+    "level4 = LossDebugger.diagnose_fitting(\n",
+    "    metrics_history=mock_history_normal,\n",
+    "    model_params=model_params,\n",
+    "    total_tokens=total_tokens,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 6. Level 5 — Architecture / Initialization Check (아키텍처 점검)\n\n모델에 입력을 한 번 통과시키며 **레이어별 activation 통계**와 **가중치 분포**를 수집합니다.\n\n### Activation 진단\n\n- **healthy**: std가 레이어 전반에 걸쳐 안정적\n- **exploding**: std가 뒤쪽 레이어로 갈수록 급격히 증가 → 초기화 스케일이 너무 큼\n- **vanishing**: std가 뒤쪽 레이어로 갈수록 급격히 감소 → 초기화 스케일이 너무 작음\n- **anomaly**: 특정 레이어에서 갑작스러운 변화 → 해당 레이어 구현 버그\n\n### Weight Initialization 진단\n\nGPT-2 스타일 초기화:\n- **일반 Linear**: `N(0, 0.02)`\n- **Residual projection** (o_proj, down_proj): `N(0, 0.02/√(2×layers))`\n  → 깊은 모델에서 잔차 기여를 줄여 안정성 확보\n\n### Ablation Study (컴포넌트별 영향 확인)\n\n| 실험 | 예상 결과 | 이상 시 |\n|------|----------|---------|\n| RMSNorm → LayerNorm | Loss 차이 미미 | 정규화 구현 버그 |\n| RoPE → 절대 위치 임베딩 | 짧은 시퀀스에서 차이 작음 | RoPE 구현 확인 |\n| SwiGLU → ReLU FFN | Loss +0.05~0.15 | SwiGLU 구현 확인 |\n| GQA → MHA | Loss 거의 동일 (메모리만 차이) | KV repeat 버그 |"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "level5 = LossDebugger.check_architecture(\n",
+    "    model=model,\n",
+    "    dataloader=train_dl,\n",
+    "    device=device,\n",
+    ")\n",
+    "\n",
+    "print(f\"\\n>>> Diagnosis: {level5['diagnosis']}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 7. Scenario Auto-Detection (시나리오 자동 감지)\n\n`metrics_history`를 분석하여 다음 4가지 시나리오 중 어디에 해당하는지 자동으로 판별합니다:\n\n| Scenario | 증상 | 주요 원인 | 핵심 진단 |\n|----------|------|----------|----------|\n| **A** | Loss가 ~10.37에서 안 줄어듦 | 데이터/구현 버그 | 단일 배치 오버피팅 테스트 |\n| **B** | Loss 감소 중 갑자기 NaN | 수치 불안정 | NaN 직전 grad norm 확인 |\n| **C** | Loss 감소 후 다시 증가 | LR/데이터 문제 | Train/Val 동시 확인 |\n| **D** | Loss가 높은 값에서 정체 | 학습 부족/LR 문제 | 토큰 수 확인, LR Range Test |\n\n### 시나리오별 진단 포인트\n\n**시나리오 A**: \"Loss가 10.37에서 전혀 안 줄어요\"\n1. 단일 배치 오버피팅 → 실패하면 모델/Loss 버그\n2. gradient가 0인지 확인 → `optimizer.step()` 누락?\n3. `input_ids/targets` shift 확인 → 데이터 파이프라인 버그\n4. LR 확인 → 0으로 설정되어 있지 않은지\n5. `model.train()` 호출 확인\n\n**시나리오 B**: \"Loss가 줄다가 갑자기 NaN\"\n1. 해당 step의 배치 데이터 확인\n2. NaN 직전의 gradient norm spike 확인\n3. LR 스케줄과 NaN 시점 비교\n4. mixed precision 문제 → fp32로 전환하여 재현 시도\n   - (Pythia-1B: fp16 → bf16 전환 사례, Biderman et al. 2023)\n\n**시나리오 C**: \"Loss가 3.5까지 줄었다가 다시 올라감\"\n1. Train/Val 동시 확인:\n   - 둘 다 오르면 → LR이 너무 큼\n   - Train만 오르면 → 데이터 품질 변화 (streaming 순서)\n   - Val만 오르면 → 오버피팅 시작\n2. LR 스케줄 확인, 데이터 셔플링 확인\n\n**시나리오 D**: \"Loss가 4.0에서 더 안 줄어요\"\n1. 학습 토큰 수 확인 (5B 미만이면 학습 부족)\n2. 100M 모델과 비교\n3. LR Range Test 실행\n4. 데이터 품질 샘플링"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# --- Scenario A ---\nprint(\"=\" * 50)\nprint(\"Testing Scenario A (loss stuck at initial value)\")\nprint(\"=\" * 50)\nscenario_a = LossDebugger.detect_scenario(\n    metrics_history=mock_history_a,\n    vocab_size=vocab_size,\n)\nprint(f\"\\n>>> Detected: Scenario {scenario_a['scenario']}\")\n\n# --- Scenario B ---\nprint(\"\\n\" + \"=\" * 50)\nprint(\"Testing Scenario B (NaN appeared)\")\nprint(\"=\" * 50)\nscenario_b = LossDebugger.detect_scenario(\n    metrics_history=mock_history_b,\n    vocab_size=vocab_size,\n)\nprint(f\"\\n>>> Detected: Scenario {scenario_b['scenario']}\")\n\n# --- Scenario C ---\nprint(\"\\n\" + \"=\" * 50)\nprint(\"Testing Scenario C (loss bounce)\")\nprint(\"=\" * 50)\nscenario_c = LossDebugger.detect_scenario(\n    metrics_history=mock_history_c,\n    vocab_size=vocab_size,\n)\nprint(f\"\\n>>> Detected: Scenario {scenario_c['scenario']}\")\n\n# --- Scenario D ---\nprint(\"\\n\" + \"=\" * 50)\nprint(\"Testing Scenario D (loss plateau)\")\nprint(\"=\" * 50)\nscenario_d = LossDebugger.detect_scenario(\n    metrics_history=mock_history_d,\n    vocab_size=vocab_size,\n)\nprint(f\"\\n>>> Detected: Scenario {scenario_d['scenario']}\")\n\n# --- Normal ---\nprint(\"\\n\" + \"=\" * 50)\nprint(\"Testing Normal scenario\")\nprint(\"=\" * 50)\nscenario_n = LossDebugger.detect_scenario(\n    metrics_history=mock_history_normal,\n    vocab_size=vocab_size,\n)\nprint(f\"\\n>>> Detected: Scenario {scenario_n['scenario']}\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. 전체 진단 (run_diagnostics)\n",
+    "\n",
+    "위의 모든 레벨을 한 번에 실행합니다. `levels` 파라미터로 실행할 레벨을 선택할 수 있습니다.\n",
+    "\n",
+    "```python\n",
+    "# 실제 학습 후 사용 예시:\n",
+    "# report = LossDebugger.run_diagnostics(\n",
+    "#     model=model, dataloader=train_dl, tokenizer=tokenizer,\n",
+    "#     train_config=train_config,\n",
+    "#     metrics_history=trainer.metrics.history,\n",
+    "#     device=device, dtype=dtype,\n",
+    "# )\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "report = LossDebugger.run_diagnostics(\n",
+    "    model=model,\n",
+    "    dataloader=train_dl,\n",
+    "    tokenizer=tokenizer,\n",
+    "    train_config=train_config,\n",
+    "    metrics_history=mock_history_a,\n",
+    "    device=device,\n",
+    "    dtype=dtype,\n",
+    "    vocab_size=vocab_size,\n",
+    "    levels=[0, 1, 2, 3, 4, 5],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": "## 9. Study Roadmap (집중 공부 로드맵)\n\nLLM 학습 최적���에 대한 체계적 학습 경로입니다.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "source": "LossDebugger.print_study_roadmap()",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 10. 디버깅 팁\n\n**권장 순서:**\n1. Level 0으로 상태 확인 → 어떤 레벨을 점검해야 하는지 알려줌\n2. Level 1 (데이터) → 가장 먼저! 70%의 문제가 여기서 발견됨\n3. Level 2 (수치 안정성) → NaN/Inf 문제 해결\n4. Level 3 (하이퍼파라미터) → LR을 먼저 튜닝 (영향력 10배)\n5. Level 4 (피팅 진단) → 충분히 학습한 후에 확인\n6. Level 5 (아키텍처) → 위 레벨에서 원인을 못 찾을 때\n\n**Scenario Detection** → 증상 기반으로 어떤 시나리오인지 자동 판별\n\n### 핵심 참고 자료\n\n| 자료 | 내용 | 우선순위 |\n|------|------|----------|\n| Karpathy \"Recipe for Training NNs\" | 실전 디버깅 마인드셋 | ⭐⭐⭐ |\n| Hoffmann et al. 2022 (Chinchilla) | Scaling Law 핵심 | ⭐⭐⭐ |\n| Kaplan et al. 2020 (Scaling Laws) | Loss 예측 공식 | ⭐⭐⭐ |\n| Touvron et al. 2023 (LLaMA) | 1B급 모델 학습 세부사항 | ⭐⭐ |\n| Biderman et al. 2023 (Pythia) | 공개 학습 로그, 재현성 | ⭐⭐ |\n| Zhang et al. 2024 (TinyLlama) | 1.1B 모델 3T 토큰 학습 | ⭐⭐ |\n| Groeneveld et al. 2024 (OLMo) | 완전 공개 LLM 프레임워크 | ⭐⭐ |\n| Li et al. 2018 (Loss Landscape) | Loss 지형 직관 | ⭐⭐ |\n| Loshchilov & Hutter 2019 (AdamW) | 옵티마이저 기초 | ⭐⭐ |\n| Yang et al. 2022 (μP) | 하이퍼파라미터 전이 | ⭐ |\n\n---\n**이전 단계:** `03_training.ipynb`에서 모델을 학습합니다.  \n**다음 단계:** `04_evaluation.ipynb`에서 학습된 모델을 평가합니다."
+  }
+ ]
+}