Spaces:

samrat-rm
/

WhyDidItFail

Sleeping

App Files Files Community

samrat-rm commited on 8 days ago

Commit

6338fc0

1 Parent(s): e6bf1cd

docs: adding detailed docs for agent_prompt , grade and scenarios

Browse files

Files changed (3) hide show

docs/agent_prompt.md +260 -0
docs/grader_and_judge.md +327 -0
docs/scenarios.md +340 -0

docs/agent_prompt.md ADDED Viewed

	@@ -0,0 +1,260 @@

+# Agent Prompt — Design Reference
+The agent prompt is the single most important lever in the inference pipeline. It determines whether the agent inspects the right sources, picks the right label, includes a valid fix, and submits at the right time. This document explains every section of the system prompt, what it does, and why it was written that way.
+The prompt lives in `inference.py` as `SYSTEM_PROMPT`. It is sent as the `system` role message on every API call.
+---
+## Overview
+The prompt has six sections, each targeting a different failure mode in agent behaviour:
+| Section | What it controls |
+|---|---|
+| Role + Action Space | Frames the task; lists valid actions |
+| Output Format | Forces JSON-only output |
+| Diagnosis Process | Step-by-step investigation procedure |
+| Label Decision Rules | Exact rules for picking the correct label |
+| Null Data Rule | Handles missing gradient data gracefully |
+| Stop Rules | When to stop inspecting and submit |
+| Rules | Final constraints and exact label vocabulary |
+---
+## Section 1 — Role and Action Space
+```
+You are a machine learning engineer diagnosing a failed training run.
+Each turn you receive data and must decide what to investigate next.
+Available actions:
+  inspect_logs       — examine training loss/accuracy curves
+  inspect_config     — examine hyperparameter config (lr, optimizer, etc.)
+  inspect_gradients  — examine gradient norm statistics
+  submit_diagnosis   — submit your final diagnosis (ends the episode)
+```
+**What it does:**
+Sets the persona and enumerates the action space.
+**Why it's written this way:**
+LLMs perform better when given a concrete professional role rather than an abstract task description. "You are an ML engineer" anchors the model in a domain where it has relevant training data. Listing the four actions explicitly prevents the model from hallucinating action names or attempting actions that don't exist in the environment.
+The one-line description after each action (`examine training loss/accuracy curves`) is there so the model knows what data each action returns before it takes the action. Without this, the model may inspect sources in the wrong order or inspect the same source twice looking for something it already saw.
+---
+## Section 2 — Output Format
+```
+OUTPUT FORMAT — STRICT:
+Output ONLY a raw JSON object. No markdown, no code fences, no backticks, no explanation.
+Start with { and end with }. One line only.
+Examples:
+  {"action_type": "inspect_logs"}
+  {"action_type": "submit_diagnosis", "diagnosis": "overfitting", "suggested_fix": "add dropout=0.3 and weight_decay=0.01", "reasoning": "train_loss fell to 0.03 by epoch 20 while val_loss rose to 2.34; train_acc=0.99 vs val_acc=0.54 — clear generalization gap. Config shows dropout=0.0 and weight_decay=0.0."}
+```
+**What it does:**
+Constrains the model to produce parseable JSON with no surrounding text.
+**Why it's written this way:**
+The inference loop calls `json.loads()` on the model's output directly. Any markdown, preamble, or explanation causes a parse error and falls back to a default action. The "STRICT" label and the explicit instructions about `{` / `}` and "one line only" are there because frontier models default to markdown code blocks when asked to output JSON in a conversational context.
+The two examples serve a second purpose: they show the model the difference between an inspection action (just `action_type`) and a diagnosis action (all four fields). Without an example, models sometimes include all fields on every action or omit fields on diagnosis. The `response_format={"type": "json_object"}` in the API call enforces JSON at the tokeniser level, but the examples set expectations about structure.
+The `reasoning` example is intentionally long and specific. It models the desired output: quote exact numbers, name the metric, state the conclusion. This shapes the judge score — the LLM judge rewards reasoning that cites specific values.
+---
+## Section 3 — Diagnosis Process
+```
+DIAGNOSIS PROCESS — follow this every episode:
+1. Call inspect_logs first — always.
+2. Read the Data field carefully. Note the exact numeric values (loss, acc, lr, gradient norms, model).
+3. If Feedback says "Next required action: inspect_X" — call that action next, no exceptions.
+4. When no required actions remain, form your diagnosis based ONLY on values you actually saw in Data.
+5. Your reasoning MUST quote specific numbers from the Data you received (e.g. "val_loss=2.34 at epoch 20, train_acc=0.99"). If you cannot quote a specific number from the Data, you have not read it — do not submit yet.
+```
+**What it does:**
+Provides a deterministic five-step procedure the agent should follow every episode.
+**Why each step is there:**
+**Step 1 — `inspect_logs first — always`:**
+Logs are the primary diagnostic signal in every scenario across all difficulty tiers. Starting elsewhere wastes a step and costs efficiency points. Making this an unconditional rule removes any ambiguity about where to begin.
+**Step 2 — `Note the exact numeric values`:**
+Models tend to read observations superficially and reason from priors ("loss was high, must be underfitting") rather than from the actual numbers in the data. Explicitly instructing the model to note specific values primes it to treat the data as ground truth rather than confirmation of a hypothesis it already had.
+**Step 3 — `If Feedback says "Next required action: inspect_X" — call that action next, no exceptions`:**
+The environment's feedback field gives explicit guidance on what to inspect next when required sources remain. This rule ensures the agent follows that guidance rather than submitting early. The phrase "no exceptions" is intentional — without a strong constraint here, models sometimes skip to `submit_diagnosis` after seeing one confident pattern in the logs, missing required config or gradient evidence.
+**Step 4 — `based ONLY on values you actually saw`:**
+This is an anti-hallucination constraint. Without it, models sometimes cite values from their training data (e.g. "the learning rate of 1e-3 is too low for Adam") rather than values they observed in the episode. The diagnosis must be grounded in what was returned by the environment.
+**Step 5 — the `quote specific numbers` requirement:**
+This does two things. First, it forces the model to verify it actually read the data before submitting. Second, it directly improves the LLM judge score — the judge's `evidence_grounding` criterion rewards reasoning that cites specific observed values. Without this instruction, models often submit correct diagnoses with vague reasoning like "the loss was high" rather than "val_loss=2.74 at epoch 20".
+---
+## Section 4 — Label Decision Rules
+```
+LABEL DECISION RULES — use these to pick the exact diagnosis label:
+- train_loss is NaN from epoch 1 AND config shows extreme weight_init (e.g. std=100) AND gradient norms are massive (>10000) → "bad weight initialization". Check config FIRST before applying the NaN rule below.
+- train_loss is NaN or inf AFTER at least one finite epoch → "exploding gradients". ABSOLUTE RULE. No other label applies.
+- loss oscillates wildly epoch-to-epoch but stays finite (no NaN) AND config shows batch_size ≤ 4 → "batch size too small" (NOT "learning rate too high"). PRIORITY RULE: check batch_size in config before applying the oscillation → lr rule.
+- loss oscillates wildly epoch-to-epoch but stays finite (no NaN) AND config shows batch_size > 4 → "learning rate too high"
+- both train_loss AND val_loss stay high with no gap (train_acc ≈ val_acc, both near random baseline ~10%) AND config shows SGD optimizer with momentum=0.0 → "optimizer misconfiguration" (NOT "underfitting"). Check config for SGD momentum before applying the underfitting rule.
+- both train_loss AND val_loss stay high with no gap (train_acc ≈ val_acc, both near random baseline ~10%) AND config does NOT show SGD with momentum=0.0 → "underfitting". ABSOLUTE RULE. Do NOT wait for gradients. Submit immediately after seeing the logs.
+- train_loss low, val_loss rising AND config shows weight_decay=0.0 exactly AND dropout=0.0 exactly → "missing regularization" (NOT "overfitting")
+- train_loss low, val_loss rising AND config shows ANY non-zero weight_decay OR ANY non-zero dropout → "overfitting" (NOT "missing regularization")
+- gradient norm = 0.0 exactly in hidden layers AND config shows ReLU activation → "dying relu"
+- gradient norm tiny but nonzero (e.g. 1e-5, 1e-8) AND config EXPLICITLY shows activation=sigmoid or activation=tanh → "vanishing gradients". Do NOT assume activation — it must be stated in the config data you actually received.
+- config shows lr_scheduler with gamma > 1.0 → "lr scheduler misconfiguration"
+- config shows weight_init with extreme std AND gradient norms >10000 → "bad weight initialization"
+- config shows SGD optimizer with momentum=0.0 → "optimizer misconfiguration"
+```
+**What it does:**
+Provides explicit conditional logic for every ambiguous pair of scenarios. This is the most critical section of the prompt.
+**Why it exists:**
+LLMs have strong priors from their training data. "Loss oscillates wildly" → the model's prior is "learning rate too high". "Both losses high" → prior is "underfitting". These priors are reasonable but wrong for some scenarios. The rules override the prior with conditional logic based on config values.
+**Why each rule is structured the way it is:**
+**`bad_weight_initialization` rule comes first:**
+Both `bad_weight_initialization` and `exploding_gradients` produce NaN. The distinguishing condition is timing (NaN from epoch 1 vs after epoch 1) and config (extreme weight_init). By checking `bad_weight_initialization` first and making `exploding_gradients` an "AFTER at least one finite epoch" rule, the two cases are unambiguously separated. If the order were reversed, a model might apply the NaN→exploding rule and never check the config.
+**`batch_size_too_small` priority rule:**
+Oscillating loss is caused by both high LR and tiny batch sizes. The PRIORITY label means: before concluding "learning rate too high" from oscillation, check `batch_size` in the config. If `batch_size ≤ 4`, that overrides. This rule was added after observing the model consistently labelling `batch_size_too_small` as "learning rate too high" — the oscillation pattern was overriding the config evidence.
+**`optimizer_misconfiguration` before `underfitting`:**
+Flat losses near baseline look exactly like underfitting. But `optimizer_misconfiguration` (SGD, momentum=0.0) produces the same log pattern for a different reason. The rule requires checking config for SGD + momentum=0.0 before committing to "underfitting". This was the other medium-tier failure: the model saw flat losses and submitted "underfitting" without checking the optimizer config.
+**"ABSOLUTE RULE" labels:**
+`exploding_gradients` and `underfitting` are marked as absolute rules because they were the two cases where the model still incorrectly applied a different label after seeing unambiguous evidence. The "ABSOLUTE RULE" phrasing raises the salience of the constraint — it signals that this rule has no exceptions and no competing rule can override it.
+**`overfitting` vs `missing_regularization`:**
+Both show train-val divergence. The split is on exact config values: `weight_decay=0.0 exactly AND dropout=0.0 exactly` → missing regularization. Any non-zero value → overfitting. The word "exactly" and "ANY non-zero" are deliberate — the model previously classified "weight_decay=0.001" as "weight_decay≈0" and got the label wrong.
+**`vanishing_gradients` — "Do NOT assume activation":**
+Vanishing gradients are caused by saturating activations (sigmoid, tanh). Without this constraint, the model sometimes infers the activation from the gradient decay pattern rather than from the config data it actually received. The rule requires explicit evidence from the config.
+---
+## Section 5 — Null Data Rule
+```
+NULL DATA RULE:
+- If Data shows {"gradient_norms": null}, gradient data was NOT collected for this run. This is normal for some scenarios — it is NOT a data pipeline error.
+- "missing data", "missing gradients", "insufficient data" are NEVER valid diagnoses. NEVER submit these. Always diagnose the ML failure mode from what you have seen.
+```
+**What it does:**
+Handles the case where gradient data is null (which happens in easy and medium scenarios, where gradients are present in the scenario dict but not required).
+**Why it exists:**
+Early testing showed the model sometimes submitting "missing gradients" or "insufficient data" as the diagnosis when it saw `{"gradient_norms": null}`. This is a valid observation (the data is null) but a catastrophically wrong response (it's not a diagnosis). The null data rule explicitly blocks this failure mode and redirects the model toward diagnosing the ML failure from whatever evidence it does have.
+The phrase "NEVER valid diagnoses. NEVER submit these" uses double emphasis because this failure mode produced a score of 0.0 — no partial credit, no keywords matched.
+---
+## Section 6 — Stop Rules
+```
+STOP RULES — mandatory:
+- "This source is not required for this failure mode." means STOP IMMEDIATELY. Submit your diagnosis on the very next action. Do NOT call any more inspect actions — not even one.
+- "Relevant clue found" with no "Next required action" → all sources covered. Submit on the next action.
+- CRITICAL: If Feedback contains "Next required action: inspect_X", you MUST call that action before submitting.
+```
+**What it does:**
+Controls exactly when the agent submits its diagnosis — not too early and not too late.
+**Why each rule is there:**
+**"STOP IMMEDIATELY" rule:**
+When the environment returns "This source is not required for this failure mode", it means the agent has wandered into an irrelevant source. Every additional inspection costs −0.02 (evidence score) and adds to the step count (efficiency score). The "not even one" phrasing prevents the model from making "just one more" check after being told to stop.
+**"No Next required action" rule:**
+The feedback field explicitly tells the agent when all required sources have been inspected. At that point, the agent has all the evidence it needs — inspecting further only wastes steps. This rule makes the trigger explicit: no "Next required action" in the feedback = submit now.
+**"CRITICAL: If Feedback contains Next required action" rule:**
+This is the counterbalancing rule. The previous two rules push the agent toward early submission. This one prevents premature submission when required sources are still outstanding. The CRITICAL label flags it as the most important stop rule — submitting too early on a hard scenario loses both efficiency points and evidence score.
+---
+## Section 7 — Rules (Label Vocabulary and Final Constraints)
+```
+RULES:
+- submit_diagnosis MUST include all three fields: diagnosis, suggested_fix, reasoning.
+- diagnosis is the short failure mode label — it is REQUIRED, never omit it.
+- Use exact failure mode phrasing for diagnosis: "exploding gradients", "overfitting", "underfitting",
+  "learning rate too high", "learning rate too low", "vanishing gradients",
+  "dying relu", "missing regularization", "batch size too small",
+  "optimizer misconfiguration", "bad weight initialization", "lr scheduler misconfiguration".
+- Never inspect the same source twice.
+```
+**What it does:**
+Enforces structural requirements on the submit action and provides the exact vocabulary for labels.
+**Why each rule is there:**
+**All three fields required:**
+The fix scorer deducts −0.05 if `suggested_fix` is absent. The LLM judge returns `None` (skipping its contribution) if `reasoning` is absent. The `diagnosis` field is the grader's primary input. Omitting any of them costs points in measurable ways.
+**Exact failure mode phrasing:**
+The grader's exact keyword map contains specific phrases like `"exploding gradients"`, `"dying relu"`, `"lr scheduler misconfiguration"`. A diagnosis that says "exploding gradient" (singular) still gets partial credit from the category keywords, but misses the +0.40 exact match. Providing the canonical list gives the model the exact strings to use — no paraphrasing, no guessing.
+**Never inspect the same source twice:**
+Re-inspecting a source returns the same data and costs −0.05 (step reward). It also wastes a step toward the efficiency score. This rule is explicit because models occasionally re-inspect logs "to double-check" before submitting.
+---
+## User Prompt
+Each step, a user message accompanies the system prompt. It has three parts:
+```
+Step {step}
+Observation:
+{obs_summary}
+Recent history:
+{history_block}
+Before responding: read the Data above carefully. What specific numeric values do you see?
+Quote at least one value from the Data in your reasoning before submitting a diagnosis.
+Respond with a JSON action.
+```
+**`obs_summary`** — formatted string with `Task`, `Feedback`, and `Data` (the visible_data JSON). This is the agent's primary input.
+**`history_block`** — the last 4 step summaries: `Step N: {action} → reward={R} | {feedback}\n  Data: {data}`. Four steps is enough to cover the full hard-tier trajectory (3 inspections + 1 submit) without exceeding the context budget.
+**"Quote at least one value"** — repeated from the system prompt in the user turn. The repetition is intentional. Models attend more strongly to instructions that appear close to the end of the prompt. Placing the evidence-citation reminder in the user message, adjacent to the actual data, increases compliance.
+---
+## Summary — What Each Section Guards Against
+| Section | Failure it prevents |
+|---|---|
+| Role + Action Space | Hallucinated actions, wrong action names |
+| Output Format | Markdown wrapping, parse errors, missing fields |
+| Diagnosis Process | Inspecting in wrong order, not reading numbers, submitting without evidence |
+| Label Decision Rules | Wrong label on ambiguous patterns (oscillation→lr vs batch, flat→underfitting vs optimizer) |
+| Null Data Rule | Diagnosing "missing gradients" instead of the actual ML failure |
+| Stop Rules | Inspecting too many sources (efficiency loss) or too few (evidence loss) |
+| Rules | Wrong label phrasing, missing fix/reasoning, redundant inspections |

docs/grader_and_judge.md ADDED Viewed

	@@ -0,0 +1,327 @@

+# Grader & LLM Judge — Internal Design
+This document explains how WhyDidItFail scores an agent's performance after each episode.
+Scoring has two layers: a **programmatic grader** that runs keyword matching, and an **LLM judge** that evaluates reasoning quality. The two scores are combined into a single final number.
+---
+## Overview
+```
+Agent submits diagnosis
+        │
+        ├─► Programmatic Grader  (server/graders.py)
+        │         five sub-scores → keyword_score  [0.0–1.0]
+        │
+        └─► LLM Judge            (server/llm_judge.py)
+                  three criteria → judge_score     [0.0–1.0]
+Final Score = 0.85 × keyword_score + 0.15 × judge_score
+```
+The grader is fast and deterministic — it runs synchronously inside the environment's `step()` call. The judge is a separate LLM call that runs **after** the episode ends (the WebSocket is already closed by then), so it never adds latency to the agent's action loop.
+---
+## Part 1 — Programmatic Grader
+**File:** `server/graders.py`
+**Entry point:** `grade(diagnosis, suggested_fix, scenario, steps_taken, inspection_order)`
+The grader produces five sub-scores and sums them. The result is clamped to `[0.0, 1.0]`.
+```
+Total = diagnosis_score + evidence_diagnosis_penalty
+      + evidence_score + efficiency_score + fix_score + ordering_bonus
+```
+### 1.1 Diagnosis Score — up to +0.70
+This is the biggest sub-score. It checks whether the agent named the correct failure mode.
+Every scenario has a `correct_diagnosis` field (e.g. `"exploding_gradients"`). There are two keyword maps in the grader:
+**Exact keywords** — strong signal, worth +0.40 each:
+```
+"exploding_gradients" → ["exploding gradients", "exploding"]
+"overfitting"         → ["overfitting", "overfit"]
+"dying_relu"          → ["dying relu", "dead relu"]
+...
+```
+**Category keywords** — weaker signal, worth +0.10 each:
+```
+"exploding_gradients" → ["nan", "gradient", "overflow", "diverge"]
+"overfitting"         → ["generalization", "val loss", "memoriz"]
+...
+```
+The agent's diagnosis string is lowercased and scanned for both sets. Matches accumulate. The result is capped at 0.70.
+**Vague answer penalty:** If no exact keyword matched and the diagnosis is fewer than 3 words, −0.10 is applied. This discourages one-word guesses.
+**Practical meaning:** A correct exact label ("exploding gradients" or "exploding") earns +0.40 right away. Hitting a few category keywords on top of that can bring it up toward the 0.70 cap. A wrong diagnosis with some related vocabulary still earns partial credit, but cannot exceed 0.70.
+---
+### 1.2 Evidence-Diagnosis Penalty — 0.00 to −0.10
+This penalty fires when the agent had access to the right data but still got the diagnosis wrong. It distinguishes between an agent that guessed blind versus one that inspected evidence and reasoned poorly.
+| Situation | Penalty |
+|---|---|
+| All required sources inspected, diagnosis wrong | −0.10 |
+| Some required sources inspected, diagnosis wrong | −0.05 |
+| No required sources inspected, diagnosis wrong | 0.00 |
+| Diagnosis correct (any case) | 0.00 |
+The evidence score (next section) already penalises skipping required sources. This penalty stacks on top when the agent collected the evidence but drew the wrong conclusion — a clear reasoning failure.
+---
+### 1.3 Evidence Score — up to +0.25 (floored at −0.15)
+This sub-score measures whether the agent inspected the right data sources before submitting.
+Each scenario defines which sources are required. Easy scenarios require only `logs`. Medium requires `logs + config`. Hard requires `logs + config + gradients`.
+```
++0.08  per required source inspected
+−0.10  per required source NOT inspected
+−0.02  per irrelevant source inspected
+```
+The raw sum is clamped to `[−0.15, +0.25]`.
+**Example — hard scenario, agent inspected logs + config but skipped gradients:**
+- logs required and inspected: +0.08
+- config required and inspected: +0.08
+- gradients required but missing: −0.10
+- Total: +0.06
+**Example — easy scenario, agent inspected logs + gradients (gradients not required):**
+- logs required and inspected: +0.08
+- gradients not required: −0.02
+- Total: +0.06
+The −0.02 for irrelevant sources is intentionally mild — some exploration is acceptable. Missing a required source is much more costly.
+---
+### 1.4 Efficiency Score — up to +0.15
+This sub-score rewards acting without waste. The minimum number of steps for any episode is: `len(required_sources) + 1` (inspect all required sources, then submit).
+```
+If steps_taken == min_steps:  score = 0.15  (full reward)
+If steps_taken > min_steps:   score = max(0.0, 0.15 − 0.02 × extra_steps^1.2)
+If steps_taken < min_steps:   score = max(0.0, 0.15 − 0.05 × missing_steps)
+```
+The exponent `1.2` makes the penalty accelerate as the agent wastes more steps — each additional wasted step costs slightly more than the previous one.
+Early submission (fewer steps than the minimum) is also penalised because the agent almost certainly skipped a required source — but that case is already caught harder by the evidence score.
+**If `steps_taken > max_steps`** (which is `len(required_sources) × 3 + 2`), the grader immediately returns `0.0` and skips all sub-scores. This is the hard ceiling.
+---
+### 1.5 Fix Score — up to +0.15 (or −0.05)
+This sub-score checks whether the agent's suggested fix is actionable and correct.
+Each scenario has a `correct_fix` string (e.g. `"enable gradient clipping (clip_grad_norm=1.0)"`). The grader tokenises that string, strips stop words (`to`, `a`, `the`, `and`, `or`, `use`, `set`, `by`), and keeps content words longer than 2 characters.
+It then counts how many of those content words appear anywhere in the agent's `suggested_fix` string.
+| Match ratio | Score |
+|---|---|
+| 100% of keywords matched | +0.15 |
+| ≥ 60% matched | +0.10 |
+| ≥ 30% matched | +0.05 |
+| < 30% matched | 0.00 |
+| No fix provided | −0.05 |
+Omitting the fix always costs points. Even a partially correct fix earns more than silence.
+---
+### 1.6 Ordering Bonus — +0.05
+A small bonus for inspecting required sources in the canonical order: `logs → config → gradients`.
+The grader extracts the subsequence of inspected sources that are required (ignoring any irrelevant ones inspected in between), and checks whether that subsequence matches the canonical order for the scenario.
+For an easy scenario (only `logs` required) the bonus is trivially earned. For hard scenarios (all three required) the agent must visit them in order.
+This rewards structured investigation — the same order a human engineer would follow.
+---
+### 1.7 Maximum Achievable Scores
+| Sub-score | Max | Min |
+|---|---|---|
+| Diagnosis | +0.70 | 0.00 |
+| Evidence-Diagnosis Penalty | 0.00 | −0.10 |
+| Evidence | +0.25 | −0.15 |
+| Efficiency | +0.15 | 0.00 |
+| Fix | +0.15 | −0.05 |
+| Ordering Bonus | +0.05 | 0.00 |
+| **Total (clamped)** | **1.00** | **0.00** |
+The theoretical maximum without fix is `0.70 + 0.25 + 0.15 + 0.05 = 1.15`, which is clamped to `1.00`. The fix and ordering bonus are therefore "free" once the other scores are high — they can push a near-perfect run to a clean `1.00`.
+---
+## Part 2 — Step-Level Rewards
+The grader described above runs only at `submit_diagnosis`. But the environment also emits a small reward on every inspection step, giving the agent an in-episode signal.
+These step rewards come from `_inspect_reward()` in the environment, not from `graders.py`.
+| Situation | Step Reward |
+|---|---|
+| First required source discovered | +0.10 |
+| Second required source discovered | +0.07 |
+| Third required source discovered | +0.05 |
+| Irrelevant source inspected | −0.03 |
+| Re-inspecting a source already seen | −0.05 |
+The decaying reward schedule (+0.10 → +0.07 → +0.05) reflects that each additional required source is slightly less surprising than the first. It also gives a larger signal for discovering the first clue, which is usually the most diagnostic piece of evidence.
+These step rewards are reported in `[STEP]` lines during inference but they are **not** included in the final episode score — `grade()` computes the terminal score independently.
+---
+## Part 3 — LLM Judge
+**File:** `server/llm_judge.py`
+**Entry point:** `judge(client, model, diagnosis, reasoning, suggested_fix, scenario, inspection_order)`
+### 3.1 What It Evaluates
+The programmatic grader is blind to *how* the agent arrived at its answer. An agent could get lucky with the right keyword and still have incoherent reasoning. The LLM judge evaluates the *quality of the agent's reasoning* across three criteria:
+| Criterion | Question it asks | Max |
+|---|---|---|
+| `evidence_grounding` | Does the reasoning cite specific values from the data the agent actually saw? | 5 |
+| `causal_chain` | Does it logically connect that evidence to the diagnosed failure mode? | 5 |
+| `fix_rationale` | Is the fix directly justified by the evidence and diagnosis? | 5 |
+Raw score range: 0–15. Normalised to 0.0–1.0 by dividing by 15.
+---
+### 3.2 When It Runs
+The judge runs **after** the episode ends — specifically after `submit_diagnosis` returns and the WebSocket connection is closed. This is intentional: the WebSocket session is single-use, so the judge call can't interfere with the agent's action loop.
+```
+Agent calls submit_diagnosis
+        │
+        └─► environment returns final_reward, done=True
+                │
+                └─► WebSocket closes
+                        │
+                        └─► inference.py calls judge()
+                                │
+                                └─► one LLM call → score
+                                        │
+                                        └─► Final Score = 0.85 × keyword + 0.15 × judge
+```
+---
+### 3.3 What the Judge Sees
+The judge prompt is built from:
+1. The agent's `diagnosis`, `suggested_fix`, and `reasoning` strings
+2. The **data the agent actually inspected** — reconstructed from `inspection_order` and the scenario's source data
+The judge is given only what the agent had access to, not the full scenario. This prevents the judge from penalising an agent for not citing data it never saw.
+```python
+seen = {}
+if "logs" in inspection_order:
+    seen["training_logs"] = scenario["logs"]
+if "config" in inspection_order:
+    seen["config"] = scenario["config"]
+if "gradients" in inspection_order:
+    seen["gradient_norms"] = scenario["gradient_norms"]
+```
+---
+### 3.4 The Judge Prompt
+The judge receives a single-turn user message (no system prompt). It is asked to return a JSON object with integer scores:
+```
+{"evidence_grounding": <int>, "causal_chain": <int>, "fix_rationale": <int>}
+```
+Temperature is set to `0.0` for determinism. `max_tokens=64` is enough for the JSON response and prevents runaway output.
+---
+### 3.5 Fallback Behaviour
+If the agent omits the `reasoning` field, the judge returns `None` immediately (no API call made). The caller treats `None` as "judge unavailable" and uses keyword score at full weight:
+```
+judge_score is None  →  Final Score = 1.0 × keyword_score
+judge_score is float →  Final Score = 0.85 × keyword_score + 0.15 × judge_score
+```
+If the LLM call itself fails (network error, parse error, etc.), the judge also returns `None` and logs the exception to stderr. The episode score is never blocked waiting on the judge.
+---
+### 3.6 Weight Rationale
+The 85/15 split keeps the judge in a supporting role. The programmatic grader is:
+- **Fast** — no extra API call during the episode
+- **Deterministic** — same input always produces same score
+- **Directly tied to the correct answer** — keyword matching is unambiguous
+The judge adds 15% for reasoning quality. This is enough to meaningfully separate agents that cite evidence from those that guess correctly without explaining why, but not enough to override a solid keyword score with a harsh reasoning judgment.
+---
+## Part 4 — Combined Final Score
+```
+Final Score = clamp(0.85 × keyword_score + 0.15 × judge_score, 0.0, 1.0)
+```
+Both the keyword score and the judge score are already in `[0.0, 1.0]`, so the final score is also in that range without needing clamping in practice. The clamp is a safeguard.
+**Example — perfect run:**
+- keyword_score = 1.00 (correct label, all sources, minimum steps, full fix, correct order)
+- judge_score = 1.00 (cites specific numbers, clean causal chain, fix matches evidence)
+- Final = 0.85 × 1.00 + 0.15 × 1.00 = **1.00**
+**Example — correct label, poor reasoning:**
+- keyword_score = 0.90 (correct label, good evidence, slightly wasteful)
+- judge_score = 0.40 (vague reasoning, no specific numbers cited)
+- Final = 0.85 × 0.90 + 0.15 × 0.40 = 0.765 + 0.060 = **0.825**
+**Example — reasoning provided but LLM call fails:**
+- keyword_score = 0.90
+- judge_score = None
+- Final = 1.00 × 0.90 = **0.90**
+---
+## Summary
+| Component | File | Weight | What it measures |
+|---|---|---|---|
+| Diagnosis Score | graders.py | Up to 0.70 of keyword_score | Correct failure mode label |
+| Evidence-Diagnosis Penalty | graders.py | Up to −0.10 of keyword_score | Reasoning failure despite having evidence |
+| Evidence Score | graders.py | Up to 0.25 of keyword_score | Right data sources inspected |
+| Efficiency Score | graders.py | Up to 0.15 of keyword_score | Steps taken vs minimum |
+| Fix Score | graders.py | Up to 0.15 of keyword_score | Actionable and correct fix |
+| Ordering Bonus | graders.py | +0.05 of keyword_score | Canonical inspection order |
+| LLM Judge | llm_judge.py | 15% of final score | Reasoning quality (evidence citation, causal logic, fix rationale) |

docs/scenarios.md ADDED Viewed

	@@ -0,0 +1,340 @@

+# Scenarios — Full Reference
+WhyDidItFail has 12 scenarios across three difficulty tiers. Each one plants a specific, realistic ML failure mode into a synthetic training run. The agent must read the available evidence and name the failure.
+This document explains every scenario in detail: what the data looks like, why the failure happens, what the agent needs to see, and what a correct diagnosis and fix look like.
+---
+## How Scenarios Work
+Every scenario is a Python dictionary with the same structure:
+| Field | What it is |
+|---|---|
+| `difficulty` | `easy`, `medium`, or `hard` |
+| `required_sources` | Which data sources the agent must inspect before submitting |
+| `logs` | Per-epoch training and validation metrics |
+| `config` | Hyperparameter configuration |
+| `gradient_norms` | Per-layer or per-epoch gradient magnitudes |
+| `correct_diagnosis` | The exact internal label the grader checks against |
+| `correct_fix` | The expected remediation (used by the fix scorer) |
+The `required_sources` field is the contract between the scenario and the grader. Inspecting a required source earns the agent reward. Submitting without inspecting one costs points. Inspecting a source not in the list is penalised mildly.
+---
+## Easy Tier — Logs Only
+Easy scenarios are diagnosable from the training logs alone. The config and gradients exist in the data but are **not required** — inspecting them costs the agent −0.02 each.
+The four easy scenarios exercise four distinct log patterns: catastrophic divergence (NaN), chaotic oscillation, train-val split, and flatness.
+---
+### Exploding Gradients
+**Correct diagnosis:** `exploding_gradients`
+**Required sources:** logs
+**Correct fix:** `enable gradient clipping (clip_grad_norm=1.0)`
+**What happens:**
+The model starts normally at epoch 1 (`train_loss=2.31`), then catastrophically diverges. By epoch 2, loss has jumped to `847.2`. By epoch 3, both train and validation loss are `NaN`. The gradient norms tell the same story: `0.43` at epoch 1, `6821.4` at epoch 2, then `NaN`.
+The config shows a reasonable learning rate (`lr=0.001`, Adam optimizer) with `clip_grad=False`. The learning rate is not the culprit — the problem is that gradients are not being clipped and grow uncontrollably during backpropagation.
+**What the agent needs to see:**
+The NaN in the logs is the definitive signal. Loss going from 2.31 → 847.2 → NaN in three epochs is an unambiguous divergence pattern. No config or gradient inspection is required to make this call.
+**Why it's easy:**
+NaN loss is one of the most recognisable failure patterns in ML. There's no ambiguity, no competing hypothesis — any agent that reads the logs and sees NaN should label this correctly.
+**Disambiguation note:**
+The config shows `lr=0.001` (not high). This rules out "learning rate too high". Once training has already produced finite loss at epoch 1, the failure at epochs 2-3 is exploding gradients, not bad weight initialization (which would be NaN from epoch 1).
+---
+### Learning Rate Too High
+**Correct diagnosis:** `learning_rate_too_high`
+**Required sources:** logs
+**Correct fix:** `reduce learning_rate to 0.01`
+**What happens:**
+The loss oscillates wildly with no convergence trend. It goes `2.30 → 15.21 → 0.82 → 9.73 → 3.15` across five epochs. There's no NaN — the model is technically alive — but it can't settle. Accuracy fluctuates similarly: `0.11 → 0.47 → 0.31 → 0.44 → 0.29`. The gradient norms swing between 0.19 and 11.2.
+The config shows `lr=1.0` with SGD. A learning rate of 1.0 is far too large for SGD on CIFAR-10 with a VGG16. Each gradient step overshoots the loss minimum, sending the model into a different part of the landscape every epoch.
+**What the agent needs to see:**
+The oscillation pattern is visible from the logs alone. The lr in the logs (`"lr": 1.0`) also appears inline with each epoch — a sharp agent can confirm the culprit without even inspecting the config separately.
+**Why it's easy:**
+Wildly oscillating loss with no NaN is a classic high-LR symptom. The `lr=1.0` value in the log lines is a strong additional signal.
+**Disambiguation note:**
+This scenario uses `batch_size=64`, which rules out batch_size_too_small (which has `batch_size=2`). The oscillation here is caused by the step size, not by gradient variance from small batches.
+---
+### Overfitting
+**Correct diagnosis:** `overfitting`
+**Required sources:** logs
+**Correct fix:** `increase dropout to 0.3 and weight_decay to 0.01`
+**What happens:**
+Train loss falls steadily while validation loss climbs: `train_loss` goes from `2.10 → 0.06` by epoch 20, while `val_loss` rises from `2.16 → 2.74`. Train accuracy reaches `0.99`; val accuracy falls from `0.55` to `0.49`. The model has memorised the training data and fails to generalise.
+The config shows `weight_decay=0.001` and `dropout=0.1` — regularisation is present but insufficient for a ResNet50 on CIFAR-10. The model is powerful enough to overfit even with light regularisation.
+**What the agent needs to see:**
+The train-val divergence pattern in the logs is the definitive signal. `train_acc=0.99` and `val_acc=0.49` by epoch 20 is a textbook generalisation gap.
+**Why it's easy:**
+Train-val divergence is one of the most studied phenomena in ML. The logs make it immediately visible.
+**Disambiguation note:**
+The config shows **non-zero** `weight_decay=0.001` and `dropout=0.1`. This distinguishes it from `missing_regularization`, which has `weight_decay=0.0` and `dropout=0.0`. An agent that reads the log pattern correctly but then inspects the config will see regularisation present — the correct label remains "overfitting" because the regularisation is insufficient, not absent.
+---
+### Underfitting
+**Correct diagnosis:** `underfitting`
+**Required sources:** logs
+**Correct fix:** `increase model capacity or use a deeper architecture`
+**What happens:**
+Both train and validation loss stay high with almost no improvement over 20 epochs. `train_loss` goes from `2.29 → 2.21`. `train_acc` barely moves: `0.11 → 0.14`. Val metrics track train metrics almost exactly — there is no train-val gap. The model is stuck near random baseline (~10% on 10-class CIFAR-10).
+The config reveals the cause: `architecture=LinearClassifier`. A linear model simply doesn't have the representational capacity to learn CIFAR-10. It's the wrong tool for the job.
+**What the agent needs to see:**
+Both losses are high and similar — there's no generalisation gap. This is the defining characteristic of underfitting. The architecture field in the config confirms it, but the log pattern alone is sufficient.
+**Why it's easy:**
+Flat losses near random baseline are unambiguous. There's no NaN, no oscillation, no divergence — just failure to learn.
+**Disambiguation note:**
+The config shows `optimizer=adam`, `momentum` not applicable. This rules out `optimizer_misconfiguration` (which requires SGD with momentum=0.0). If the agent is careful, it notes that `weight_decay=0.0` — but that's irrelevant here because both losses are high and similar, which points to capacity problems, not regularisation.
+---
+## Medium Tier — Logs + Config
+Medium scenarios cannot be diagnosed from logs alone. The log pattern is ambiguous — it looks like something, but the config is needed to confirm or redirect the diagnosis. Both sources are required.
+---
+### Learning Rate Too Low
+**Correct diagnosis:** `learning_rate_too_low`
+**Required sources:** logs, config
+**Correct fix:** `increase learning_rate to 0.001`
+**What happens:**
+Loss is decreasing, but imperceptibly. Over 20 epochs: `2.302 → 2.298 → 2.290 → 2.275`. The model is technically learning — loss goes down — but at a rate that would take thousands of epochs to reach convergence. Gradient norms are tiny (`0.0031 → 0.0021`) but non-zero.
+The config reveals the cause: `lr=0.000001` (1e-6). Adam is being used, but with a learning rate 1000× smaller than typical. Each update moves the weights by almost nothing.
+**What the agent needs to see:**
+The logs alone show slow convergence, but "slow convergence" could also be underfitting (wrong architecture) or a hard dataset. The config's `lr=0.000001` is the confirmation — this tiny value is the explicit reason for the glacial pace.
+**Why it requires both sources:**
+From the logs alone, the agent sees: loss is decreasing but slowly. That's ambiguous. Only when the config shows `lr=0.000001` does the diagnosis become unambiguous — the optimizer is working correctly, just with too small a step size.
+**Disambiguation note:**
+Gradient norms are small but non-zero (0.003 range). This rules out vanishing gradients, where norms would be exponentially decaying toward zero across layers. The norms are uniformly small here because the LR makes every update tiny, not because gradients are being crushed during backprop.
+---
+### Missing Regularization
+**Correct diagnosis:** `missing_regularization`
+**Required sources:** logs, config
+**Correct fix:** `add weight_decay=0.01 and dropout=0.3`
+**What happens:**
+Train loss drops fast and val loss rises: `train_loss=0.01`, `val_loss=2.10` by epoch 30. `train_acc=1.00`, `val_acc=0.56`. The log pattern looks exactly like overfitting.
+But the config shows `weight_decay=0.0` and `dropout=0.0`. There is no regularisation at all. A ResNet101 on CIFAR-10 with zero regularisation will memorise the training set completely — not because the regularisation is insufficient, but because it was never added.
+**What the agent needs to see:**
+The log pattern alone is indistinguishable from `overfitting`. Only the config confirms the distinction: `overfitting` has `weight_decay=0.001, dropout=0.1` (regularisation present but insufficient), while `missing_regularization` has both at exactly `0.0`.
+**Why it requires both sources:**
+Logs → "train-val divergence, probable overfitting". Config → "weight_decay=0.0 AND dropout=0.0" → "missing regularization, not overfitting". The config is the discriminating evidence.
+**Disambiguation note:**
+The label distinction matters for the fix. "Overfitting" → increase existing regularisation. "Missing regularization" → add regularisation from scratch. An agent that labels this "overfitting" has diagnosed the symptom correctly but missed the root cause.
+---
+### Batch Size Too Small
+**Correct diagnosis:** `batch_size_too_small`
+**Required sources:** logs, config
+**Correct fix:** `increase batch_size to at least 32`
+**What happens:**
+Loss oscillates wildly across 8 epochs: `4.12 → 3.87 → 4.45 → 3.21 → 4.78 → 3.44 → 4.93 → 3.67`. There's a downward trend on average, but every epoch spikes or drops dramatically. Gradient norms alternate: `1.21 → 0.38 → 2.04 → 0.29 → 1.87 → 0.41 → 2.31 → 0.55`.
+The config reveals the cause: `batch_size=2`. With only 2 samples per gradient update, each update is a noisy estimate of the true gradient. The model steps in a random direction every epoch — sometimes helpful, sometimes not.
+**What the agent needs to see:**
+The oscillation pattern in the logs looks like "learning rate too high" — and the agent might guess that first. The config's `batch_size=2` is the discriminating signal. The learning rate is reasonable (`lr=0.001` with SGD + momentum=0.9).
+**Why it requires both sources:**
+Oscillating loss is caused by both high LR and tiny batch sizes. The logs alone cannot disambiguate. Only after seeing `batch_size=2` in the config can the agent correctly attribute the noise to gradient variance rather than step size.
+**Disambiguation note:**
+The learning rate (`lr=0.001`) and optimizer (`sgd`, `momentum=0.9`) are standard — this is not a high LR situation. The batch size is the anomaly. An agent that stops at the logs and labels this "learning rate too high" has confused the symptom with the wrong cause.
+---
+### Optimizer Misconfiguration
+**Correct diagnosis:** `optimizer_misconfiguration`
+**Required sources:** logs, config
+**Correct fix:** `set momentum=0.9 for SGD optimizer`
+**What happens:**
+Both train and validation loss decline extremely slowly and plateau: `2.30 → 2.25 → 2.25 → 2.23 → 2.22` over 20 epochs. The losses are similar (no train-val gap), and accuracy never improves meaningfully. Gradient norms are normal (`0.42 → 0.36`), ruling out gradient-level pathologies.
+The config shows the cause: `optimizer=sgd, momentum=0.0`. SGD with no momentum has no gradient averaging. On a complex loss landscape with saddle points and flat regions, momentum is what allows SGD to keep moving. Without it, the optimizer stalls on flat plateaus and saddle points.
+**What the agent needs to see:**
+The logs show flat losses similar to underfitting. The config reveals `momentum=0.0` as the specific misconfiguration — this is not an architecture capacity problem, it's a missing optimizer hyperparameter.
+**Why it requires both sources:**
+From logs alone, flat losses near baseline look identical to underfitting (wrong architecture). Only the config — specifically `optimizer=sgd, momentum=0.0` — reveals the true cause.
+**Disambiguation note:**
+The architecture is `ResNet18`, which is not a capacity-limited model like `LinearClassifier` in the underfitting scenario. The learning rate (`lr=0.01`) is also reasonable. The only anomaly is `momentum=0.0`. An agent that sees flat losses and labels this "underfitting" has missed the config evidence.
+---
+## Hard Tier — Logs + Config + Gradients
+Hard scenarios require all three data sources. The logs alone are ambiguous, the config narrows it down, and the gradient norms provide the final confirming evidence. All three are required.
+---
+### Vanishing Gradients
+**Correct diagnosis:** `vanishing_gradients`
+**Required sources:** logs, config, gradients
+**Correct fix:** `switch activation to relu and add batch normalization`
+**What happens:**
+Training on MNIST with a 20-layer MLP using sigmoid activations. Loss barely moves over 20 epochs: `2.30 → 2.27`, accuracy stuck at `0.11-0.12`. The logs suggest the model is barely learning.
+The config reveals the activation: `activation=sigmoid`. The gradients tell the full story: the output layer has norm `0.21`, but by layer 1 (the input end) the norm has decayed to `0.00000001` — eight orders of magnitude smaller. The gradient is effectively zero at the input layers, so those weights never update.
+**What the agent needs to see:**
+Logs → slow/stuck learning (ambiguous). Config → sigmoid activation in a deep network (suspicious). Gradients → exponential decay from output to input (`0.21 → 0.0031 → 0.000042 → 0.0000003 → 0.00000001`) confirms vanishing gradients. All three required.
+**The mechanism:**
+Sigmoid squashes its output to (0, 1) and its gradient is at most 0.25. In a 20-layer network, the gradient is multiplied by this factor at every layer during backpropagation. `0.25^20 ≈ 10^-12` — essentially zero by the time it reaches the input.
+**Disambiguation note:**
+The gradient decay is exponential across layers, and each layer's norm is non-zero (unlike dying ReLU, where norms are exactly 0.0). The config explicitly states `activation=sigmoid`, which is the known culprit for this failure mode.
+---
+### Dying ReLU
+**Correct diagnosis:** `dying_relu`
+**Required sources:** logs, config, gradients
+**Correct fix:** `reduce learning_rate to 0.001 or switch to leaky_relu activation`
+**What happens:**
+Training starts: `train_loss=2.31` at epoch 1, then partially improves to `1.95` by epoch 2. But then it freezes — epochs 2, 3, and 5 all show identical numbers: `train_loss=1.95, val_loss=2.01, train_acc=0.28`. The model has stopped learning completely.
+The config shows `lr=0.1` (high for SGD) and `activation=relu`. The gradients explain the freeze: the output layer has norm `0.15`, but every hidden layer (`layer_8`, `layer_6`, `layer_4`, `layer_2`) shows a gradient norm of exactly `0.0`. Not small — exactly zero.
+**What the agent needs to see:**
+Logs → partial improvement then sudden freeze (unusual pattern). Config → high LR with ReLU (combination that causes dying ReLU). Gradients → exact zeros in all hidden layers confirms dead neurons.
+**The mechanism:**
+ReLU neurons output 0 when their input is negative. With a high learning rate, a large gradient update can push a neuron's weights so negative that its input is always negative — making it always output 0. Once dead, the gradient through a dead ReLU is also 0, so the neuron can never recover. At `lr=0.1`, many neurons die in the first few updates, causing the training loss to freeze.
+**Disambiguation note:**
+The key signal is gradient norms of exactly `0.0` in hidden layers. Vanishing gradients produce tiny but nonzero norms (e.g. `1e-8`). Dying ReLU produces norms that are precisely zero because the ReLU derivative is a hard 0 for dead neurons — not an approximation.
+---
+### Bad Weight Initialization
+**Correct diagnosis:** `bad_weight_initialization`
+**Required sources:** logs, config, gradients
+**Correct fix:** `use kaiming or xavier weight initialization`
+**What happens:**
+Loss is `NaN` from epoch 1. There's no "first good epoch" — the model fails immediately. The gradient norms are astronomically large: `98432.1`, `74219.8`, `55103.4` across the first three layers. These values are orders of magnitude higher than any normal run.
+The config shows `weight_init=normal_std_100`. Initialising weights from a normal distribution with standard deviation 100 means the initial weights are enormous. The first forward pass immediately overflows: large weights → enormous pre-activations → NaN in the loss.
+**What the agent needs to see:**
+Logs → NaN from epoch 1 (but NaN from epoch 1 could also be exploding gradients in a different sense). Config → `weight_init=normal_std_100` (the explicit anomaly). Gradients → norms in the tens of thousands confirming the extreme magnitude.
+**Why it's different from exploding gradients:**
+`exploding_gradients` shows a finite loss at epoch 1 that then diverges. `bad_weight_initialization` is NaN from the very first epoch, before training has meaningfully begun. The config's `weight_init=normal_std_100` is the discriminating evidence. The correct rule: if NaN starts at epoch 1 AND config shows extreme weight_init → bad weight initialization. If NaN starts after epoch 1 → exploding gradients.
+**Disambiguation note:**
+An agent that labels this "exploding gradients" based on the NaN alone has not used the config or gradient evidence. The gradient norms (`98432`) and weight_init (`std=100`) are specific, confirming evidence that changes the label.
+---
+### LR Scheduler Misconfiguration
+**Correct diagnosis:** `lr_scheduler_misconfiguration`
+**Required sources:** logs, config, gradients
+**Correct fix:** `set gamma to 0.1 so the scheduler decreases lr instead of increasing it`
+**What happens:**
+Training progresses normally for the first few epochs, then suddenly explodes at epoch 6: `train_loss` jumps from `1.20 → 9.87`. The model recovers and trains again, then explodes again at epoch 11: `train_loss=87.3`. The log shows the lr inline: `lr=0.001` at epoch 5, `lr=0.01` at epoch 6, `lr=0.1` at epoch 11 — the learning rate is growing 10× every 5 epochs. Gradient norms follow: `0.42` at epoch 5, `18.73` at epoch 6, `156.2` at epoch 11.
+The config reveals the cause: `lr_scheduler=StepLR, step_size=5, gamma=10.0`. StepLR multiplies the learning rate by `gamma` every `step_size` epochs. With `gamma=10.0` (a value greater than 1.0), the scheduler is **increasing** the learning rate at each step instead of decreasing it. This is a configuration error — the intended behaviour is `gamma < 1.0` (e.g. 0.1) to decay the learning rate over time.
+**What the agent needs to see:**
+Logs → periodic spikes at epochs 6 and 11, correlated with lr jumps. Config → `StepLR, gamma=10.0` (the `> 1.0` gamma is the bug). Gradients → norm spikes at the same intervals confirming the lr jump as the trigger.
+**The mechanism:**
+StepLR is a common scheduler used to reduce the learning rate as training progresses (e.g. every 5 epochs, multiply by 0.1). Setting `gamma=10.0` inverts this — every 5 epochs the lr grows by 10×. By epoch 11, the lr has gone from 0.001 to 0.1, which at that point is high enough to cause divergence.
+**Disambiguation note:**
+The loss spikes are **periodic and predictable** — exactly every 5 epochs. This is the distinguishing characteristic of a scheduler bug. Random exploding gradients don't follow a fixed interval. An agent that sees the spike at epoch 6 and doesn't look for another one at epoch 11 may miss the periodic pattern.
+---
+## Scenario Matrix
+| Scenario | Tier | Required Sources | Key Signal | Discriminator |
+|---|---|---|---|---|
+| exploding_gradients | Easy | logs | loss → NaN after epoch 1 | NaN after finite start |
+| learning_rate_too_high | Easy | logs | loss oscillates wildly, stays finite | lr=1.0 visible in logs |
+| overfitting | Easy | logs | train-val gap widens | val_loss rising while train_loss falls |
+| underfitting | Easy | logs | both losses flat near baseline | no train-val gap |
+| learning_rate_too_low | Medium | logs + config | imperceptibly slow loss decrease | config: lr=1e-6 |
+| missing_regularization | Medium | logs + config | train-val divergence (same as overfitting) | config: weight_decay=0.0 AND dropout=0.0 |
+| batch_size_too_small | Medium | logs + config | oscillating loss | config: batch_size=2 |
+| optimizer_misconfiguration | Medium | logs + config | flat losses (same as underfitting) | config: sgd, momentum=0.0 |
+| vanishing_gradients | Hard | logs + config + gradients | slow learning | gradients decay exponentially by layer |
+| dying_relu | Hard | logs + config + gradients | partial improvement then frozen | gradients exactly 0.0 in hidden layers |
+| bad_weight_initialization | Hard | logs + config + gradients | NaN from epoch 1 | config: weight_init=normal_std_100 |
+| lr_scheduler_misconfiguration | Hard | logs + config + gradients | periodic loss spikes | config: gamma=10.0 (> 1.0) |
+---
+## Design Principles
+**Paired ambiguity:** Several scenarios are deliberately paired to test fine-grained discrimination:
+- `overfitting` vs `missing_regularization` — same log pattern, different config
+- `underfitting` vs `optimizer_misconfiguration` — same log pattern, different config
+- `exploding_gradients` vs `bad_weight_initialization` — both NaN, different timing
+- `vanishing_gradients` vs `dying_relu` — both near-zero gradients, different kind (decay vs exact zero)
+**Progressive evidence:** Each tier adds one more required source. Easy = logs only. Medium = logs must be correlated with config. Hard = all three sources must align. An agent that inspects only logs on a hard scenario will have ambiguous evidence and is likely to mis-diagnose.
+**Realistic data:** The numeric values (loss curves, gradient magnitudes, learning rates) are calibrated to match what a practitioner would actually see in a real training run, not toy examples.