Spaces:

Torchflow1
/

Multi-Agent-Incident-Command-Center

Sleeping

App Files Files Community

SwapnilPatil28 commited on 26 days ago

Commit

540b82c

verified ·

1 Parent(s): 3c61da6

Upgrade 1 - Dashboard Update and new Incidents

Browse files

Files changed (6) hide show

README.md +3 -3
docs/BLOG_POST.md +2 -2
docs/SUBMISSION_CHECKLIST.md +1 -1
server/app.py +328 -1
server/domain/incidents.py +876 -2
server/llm_remote.py +193 -0

README.md CHANGED Viewed

@@ -43,7 +43,7 @@ A **virtual war room** where three specialist agents resolve a live queue of rea
 | 🧪 **Investigator** | Apply a fix · roll back a deploy | Escalate or file a post-mortem |
 | 👷 **Ops Manager** | Escalate · file post-mortem · **close the ticket** | Apply a code fix |
-**13 real incidents** · **3 difficulty tiers** (easy / medium / hard) · **14+ named reward signals** · **customer-tier weighting** (enterprise outages cost ~3× a free-tier outage)
 > Wrong actor → **−0.08**. Wrong root-cause on an enterprise ticket → **−1.98**. Correct closure on an enterprise ticket → **+1.44**. The rules matter — and every step tells you *why* it was scored.
@@ -661,7 +661,7 @@ Two scripts judges (or you) can run without a local IDE:
 │   ├── Dockerfile                     # Production image (HEALTHCHECK included)
 │   └── domain/
 │       ├── __init__.py
-│       ├── incidents.py               # 13 enterprise incident templates + factory
 │       ├── reward.py                  # Composable rubric engine (20+ components)
 │       ├── roles.py                   # Role-based permission policy
 │       └── rng.py                     # Deterministic per-episode RNG
@@ -697,7 +697,7 @@ ENV_LOG_LEVEL: "INFO"
 Full checklist with pre-submission smoke tests → [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md).
 - [x] **OpenEnv latest runtime** and `openenv validate` passing — [Space live](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)
-- [x] **Multi-agent, long-horizon environment** with role-gated action space (3 roles × 9 actions, 13 incidents)
 - [x] **Composable, transparent, anti-gaming reward rubric** (14+ named components, tier-scaled)
 - [x] **Business-impact-aware scoring** (customer tier, revenue impact, SLA countdown)
 - [x] **End-to-end TRL SFT pipeline** that saves a checkpoint and re-evaluates it in the environment ([`train_trl.py`](./train_trl.py))

 | 🧪 **Investigator** | Apply a fix · roll back a deploy | Escalate or file a post-mortem |
 | 👷 **Ops Manager** | Escalate · file post-mortem · **close the ticket** | Apply a code fix |
+**30 unique incident templates** · **3 difficulty tiers** (8 easy / 11 medium / 11 hard) · **14+ named reward signals** · **customer-tier weighting** (enterprise outages cost ~3× a free-tier outage)
 > Wrong actor → **−0.08**. Wrong root-cause on an enterprise ticket → **−1.98**. Correct closure on an enterprise ticket → **+1.44**. The rules matter — and every step tells you *why* it was scored.
 │   ├── Dockerfile                     # Production image (HEALTHCHECK included)
 │   └── domain/
 │       ├── __init__.py
+│       ├── incidents.py               # 30 enterprise incident templates + factory
 │       ├── reward.py                  # Composable rubric engine (20+ components)
 │       ├── roles.py                   # Role-based permission policy
 │       └── rng.py                     # Deterministic per-episode RNG
 Full checklist with pre-submission smoke tests → [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md).
 - [x] **OpenEnv latest runtime** and `openenv validate` passing — [Space live](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)
+- [x] **Multi-agent, long-horizon environment** with role-gated action space (3 roles × 9 actions, **30 unique incident templates**)
 - [x] **Composable, transparent, anti-gaming reward rubric** (14+ named components, tier-scaled)
 - [x] **Business-impact-aware scoring** (customer tier, revenue impact, SLA countdown)
 - [x] **End-to-end TRL SFT pipeline** that saves a checkpoint and re-evaluates it in the environment ([`train_trl.py`](./train_trl.py))

docs/BLOG_POST.md CHANGED Viewed

@@ -24,7 +24,7 @@
 Each role has **different permissions**, **different information needs**, and a **different clock to beat**. Get it wrong and you bleed budget, bust the SLA, and — if the customer is on an enterprise contract — lose serious money (~3× what a free-tier outage costs).
-I built a simulator of that war room — an **OpenEnv-compatible** environment with 13 realistic incidents, 3 specialist roles, and 14+ named reward signals — and fine-tuned an LLM to run it.
 | Role | Can do | Cannot do |
 |---|---|---|
@@ -242,7 +242,7 @@ I ran the exact same pipeline with the smaller **Qwen2.5-0.5B-Instruct** backbon
 ## 8. What's next
 - **Replace SFT with GRPO or PPO** using the environment's native reward signal — no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
-- **Scale the incident catalog** from 13 templates to 50+ (drop in JSON-defined scenarios).
 - **Add a second "adversarial" agent** that injects misleading signals to test robustness.
 - **Record the 2-minute walkthrough** from [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) as a bonus companion to this writeup.

 Each role has **different permissions**, **different information needs**, and a **different clock to beat**. Get it wrong and you bleed budget, bust the SLA, and — if the customer is on an enterprise contract — lose serious money (~3× what a free-tier outage costs).
+I built a simulator of that war room — an **OpenEnv-compatible** environment with **30 realistic incident templates**, 3 specialist roles, and 14+ named reward signals — and fine-tuned an LLM to run it.
 | Role | Can do | Cannot do |
 |---|---|---|
 ## 8. What's next
 - **Replace SFT with GRPO or PPO** using the environment's native reward signal — no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
+- **Grow the incident catalog further** (now at 30 templates — next stop 50+ via JSON-defined scenarios).
 - **Add a second "adversarial" agent** that injects misleading signals to test robustness.
 - **Record the 2-minute walkthrough** from [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) as a bonus companion to this writeup.

docs/SUBMISSION_CHECKLIST.md CHANGED Viewed

@@ -26,7 +26,7 @@ Status against every hard gate in the official judging rules, plus every polish
 - [x] Multi-role, multi-agent — `triage_agent`, `investigator_agent`, `ops_manager_agent` with **non-overlapping permissions** (`server/domain/roles.py`).
 - [x] Long-horizon — 3–5 sequential incidents per episode, 20–60 steps each, shared SLA + budget counters.
 - [x] Professional / enterprise task simulation — realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
-- [x] 13 unique incident templates across easy / medium / hard (`server/domain/incidents.py`).
 - [x] Rich observation schema — customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints, `reward_components`, `last_action_notes`.
 - [x] Composable reward rubric with **14+ named components** and anti-gaming safeguards (`server/domain/reward.py`).
 - [x] Tier-weighted business impact (`free ×0.6 · standard ×1.0 · premium ×1.4 · enterprise ×1.8`).

 - [x] Multi-role, multi-agent — `triage_agent`, `investigator_agent`, `ops_manager_agent` with **non-overlapping permissions** (`server/domain/roles.py`).
 - [x] Long-horizon — 3–5 sequential incidents per episode, 20–60 steps each, shared SLA + budget counters.
 - [x] Professional / enterprise task simulation — realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
+- [x] **30 unique incident templates** across easy / medium / hard (`server/domain/incidents.py`) — 8 easy, 11 medium, 11 hard, covering services (payments, auth, CDN, search, DNS, ML inference, storage, scheduling, messaging, config distribution) and failure modes (OOM, cert expiry, config drift, DNS TTL staleness, rate-limit cascades, GPU fragmentation, cross-region replication lag, DST scheduler bugs, firmware regressions, cache-key tenant collisions).
 - [x] Rich observation schema — customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints, `reward_components`, `last_action_notes`.
 - [x] Composable reward rubric with **14+ named components** and anti-gaming safeguards (`server/domain/reward.py`).
 - [x] Tier-weighted business impact (`free ×0.6 · standard ×1.0 · premium ×1.4 · enterprise ×1.8`).

server/app.py CHANGED Viewed

@@ -38,8 +38,13 @@ from server.domain.reward import (
     TIER_MULTIPLIER,
 )
 from server.environment import IncidentCommandCenterEnvironment
 from server.logging_utils import configure_logging
 _LOG = logging.getLogger("icc.app")
 _CONFIG = EnvConfig.from_env()
 configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
@@ -173,6 +178,154 @@ async def env_info() -> JSONResponse:
     return JSONResponse(_metadata_payload())
 @app.get("/metrics", response_class=PlainTextResponse)
 async def metrics() -> PlainTextResponse:
     env = _resolve_environment()
@@ -326,6 +479,81 @@ def _dashboard_html() -> str:
     # so the existing `{themes_html}` slot renders to nothing (no duplication).
     themes_html = ""
     # --- Reward-rubric details ----------------------------------------------
     reward_rubric_rows = "".join(
         f"<tr><td><code>{name}</code></td><td>{value}</td></tr>"
@@ -402,6 +630,40 @@ def _dashboard_html() -> str:
     td.delta.good {{ color: var(--good); }}
     .links {{ display:flex; flex-wrap:wrap; gap:0.5rem; }}
     /* "Story in 2 minutes" hero panel — plain-English summary for judges. */
     .hero-card {{
       background: linear-gradient(135deg, #0f2647 0%, #172a4a 60%, #1f2a44 100%);
@@ -477,7 +739,8 @@ def _dashboard_html() -> str:
       <h3 style='margin-top:1.25rem'>What is the environment?</h3>
       <p class='sub' style='margin:0 0 0.75rem'>
         Three specialist agents with <strong>different permissions</strong> resolve
-        a live queue of 13 realistic tech incidents across 3 difficulty tiers.
       </p>
       <div class='table-wrap'>
         <table>
@@ -684,6 +947,8 @@ def _dashboard_html() -> str:
     {ablation_html}
     {themes_html}
     <h2>Endpoints</h2>
@@ -763,6 +1028,68 @@ def _dashboard_html() -> str:
       const total = Object.values(data.incidents_per_task || {{}}).reduce((a,b)=>a+b,0);
       document.getElementById('kpi-inc').textContent = total;
     }} catch (e) {{}}
   </script>
 </body>
 </html>

     TIER_MULTIPLIER,
 )
 from server.environment import IncidentCommandCenterEnvironment
+from server import llm_remote
 from server.logging_utils import configure_logging
+import re as _re
+_JSON_RE = _re.compile(r"\{[\s\S]*\}")
 _LOG = logging.getLogger("icc.app")
 _CONFIG = EnvConfig.from_env()
 configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
     return JSONResponse(_metadata_payload())
+# ---------------------------------------------------------------------------
+# Live LLM inference demo (optional — only enabled when HF credentials set)
+# ---------------------------------------------------------------------------
+def _build_demo_prompt(obs: IncidentObservation) -> str:
+    """Same prompt format the SFT model was fine-tuned on (train_trl.obs_to_prompt)."""
+    targets = obs.investigation_targets or {}
+    return (
+        "You are operating a multi-agent incident command center. "
+        "Pick the next action for the appropriate specialist role.\n\n"
+        f"Incident ID: {obs.incident_id}\n"
+        f"Title: {obs.incident_title}\n"
+        f"Description: {obs.incident_description}\n"
+        f"Customer tier: {obs.customer_tier} | "
+        f"Affected users: {obs.affected_users_estimate} | "
+        f"Revenue impact (USD/min): {obs.revenue_impact_usd_per_min}\n"
+        f"Postmortem required: {obs.postmortem_required}\n"
+        f"Visible signals: {', '.join(obs.visible_signals or [])}\n"
+        f"Available log targets: {', '.join(targets.get('logs', []) or [])}\n"
+        f"Available metric targets: {', '.join(targets.get('metrics', []) or [])}\n"
+        f"Available KB articles: {', '.join(targets.get('kb', []) or [])}\n"
+        f"Budget remaining: {obs.budget_remaining} actions | "
+        f"SLA remaining: {obs.sla_minutes_remaining} min | "
+        f"Clues found: {obs.clues_found} | "
+        f"Mitigation applied: {obs.mitigation_applied}\n"
+        f"Last terminal output: {obs.terminal_output}\n\n"
+        "Respond with a JSON object containing exactly these keys: "
+        "actor, action_type, target, root_cause, resolution_summary, "
+        "postmortem_note, confidence, reason."
+    )
+def _parse_llm_action(response_text: str) -> Dict[str, Any]:
+    """Extract the first balanced JSON object from a model response."""
+    match = _JSON_RE.search(response_text or "")
+    if not match:
+        return {}
+    raw = match.group(0)
+    last_close = raw.rfind("}")
+    if last_close != -1:
+        raw = raw[: last_close + 1]
+    try:
+        return json.loads(raw)
+    except (json.JSONDecodeError, TypeError):
+        return {}
+@app.get("/llm-demo-status", response_class=JSONResponse)
+async def llm_demo_status() -> JSONResponse:
+    """Report whether the live-inference panel is usable (credentials set)."""
+    return JSONResponse(llm_remote.status_summary())
+@app.post("/llm-demo", response_class=JSONResponse)
+async def llm_demo(payload: Dict[str, Any]) -> JSONResponse:
+    """Run one live step against the fine-tuned model behind an HF endpoint.
+    Spins up a fresh isolated ``IncidentCommandCenterEnvironment`` for each
+    call so the demo never disturbs the main environment instance that is
+    answering ``/reset`` and ``/step`` for training clients. Returns the full
+    trace (observation → prompt → raw LLM text → parsed action → reward) so
+    judges can see exactly what the model produced.
+    """
+    if not llm_remote.is_configured():
+        return JSONResponse(
+            {
+                "error": "Remote LLM not configured on this Space.",
+                "status": llm_remote.status_summary(),
+            },
+            status_code=503,
+        )
+    task_name = str(payload.get("task_name") or "easy").strip()
+    try:
+        seed = int(payload.get("seed") or _CONFIG.default_seed)
+    except (TypeError, ValueError):
+        seed = _CONFIG.default_seed
+    # Isolated env so the live demo never clobbers the shared state.
+    env = IncidentCommandCenterEnvironment()
+    obs = env.reset(task_name=task_name, seed=seed)
+    prompt = _build_demo_prompt(obs)
+    try:
+        raw_response = llm_remote.generate(prompt)
+    except Exception as exc:  # pragma: no cover - network-dependent
+        return JSONResponse(
+            {
+                "error": f"Remote LLM call failed: {exc}",
+                "status": llm_remote.status_summary(),
+            },
+            status_code=502,
+        )
+    parsed_action_dict = _parse_llm_action(raw_response)
+    try:
+        action = IncidentAction(**parsed_action_dict)
+        parsed_ok = True
+    except Exception:
+        logs = (obs.investigation_targets or {}).get("logs", []) or []
+        fallback_target = logs[0] if logs else "payments-api"
+        action = IncidentAction(
+            actor="triage_agent",
+            action_type="inspect_logs",
+            target=fallback_target,
+            reason="Fallback (LLM JSON invalid).",
+        )
+        parsed_ok = False
+    step_obs = env.step(action)
+    reward_components = dict(step_obs.reward_components or {})
+    reward_total = sum(reward_components.values()) if reward_components else 0.0
+    return JSONResponse(
+        {
+            "task_name": task_name,
+            "seed": seed,
+            "observation_before": {
+                "incident_id": obs.incident_id,
+                "incident_title": obs.incident_title,
+                "customer_tier": obs.customer_tier,
+                "affected_users_estimate": obs.affected_users_estimate,
+                "revenue_impact_usd_per_min": obs.revenue_impact_usd_per_min,
+                "visible_signals": obs.visible_signals,
+                "investigation_targets": obs.investigation_targets,
+                "budget_remaining": obs.budget_remaining,
+                "sla_minutes_remaining": obs.sla_minutes_remaining,
+            },
+            "prompt": prompt,
+            "raw_llm_response": raw_response,
+            "parsed_action": parsed_action_dict,
+            "validated_action": action.model_dump(exclude_none=True),
+            "fallback_used": not parsed_ok,
+            "step_result": {
+                "reward_total": round(reward_total, 4),
+                "reward_components": {
+                    k: round(v, 4) for k, v in reward_components.items()
+                },
+                "done": bool(step_obs.done),
+                "terminal_output": step_obs.terminal_output,
+                "last_action_notes": list(step_obs.last_action_notes or []),
+            },
+        }
+    )
 @app.get("/metrics", response_class=PlainTextResponse)
 async def metrics() -> PlainTextResponse:
     env = _resolve_environment()
     # so the existing `{themes_html}` slot renders to nothing (no duplication).
     themes_html = ""
+    # --- Live inference panel (only shown when HF credentials set) ----------
+    llm_status = llm_remote.status_summary()
+    if llm_status.get("configured"):
+        live_panel_html = f"""
+    <h2>Try the fine-tuned model live</h2>
+    <div class='card'>
+      <p class='sub'>
+        Spin up an isolated episode and watch the <strong>fine-tuned SFT model</strong>
+        pick the next action in real time. The prompt below is the exact format
+        used during training, so you can see how the model transforms a raw
+        observation into a typed <code>IncidentAction</code> — and the
+        environment's reward response.
+      </p>
+      <div class='live-controls'>
+        <label>Task
+          <select id='live-task'>
+            <option value='easy'>easy</option>
+            <option value='medium'>medium</option>
+            <option value='hard' selected>hard</option>
+          </select>
+        </label>
+        <label>Seed
+          <input id='live-seed' type='number' value='42' min='0' step='1' />
+        </label>
+        <button id='live-run' class='pill cta'>▶ Run one step</button>
+        <span id='live-status' class='sub'>Endpoint: {llm_status.get('host', '—')} · mode: {llm_status.get('mode', 'chat')}</span>
+      </div>
+      <div id='live-output' class='live-output' hidden>
+        <div class='live-grid'>
+          <div>
+            <h4>Observation (before)</h4>
+            <pre id='live-obs-before'></pre>
+          </div>
+          <div>
+            <h4>Prompt sent to model</h4>
+            <pre id='live-prompt'></pre>
+          </div>
+          <div>
+            <h4>Raw LLM response</h4>
+            <pre id='live-raw'></pre>
+          </div>
+          <div>
+            <h4>Parsed &amp; validated action</h4>
+            <pre id='live-action'></pre>
+          </div>
+          <div class='live-grid-full'>
+            <h4>Environment step result</h4>
+            <pre id='live-step'></pre>
+          </div>
+        </div>
+      </div>
+      <div id='live-error' class='live-error' hidden></div>
+    </div>
+"""
+    else:
+        live_panel_html = f"""
+    <h2>Try the fine-tuned model live</h2>
+    <div class='card'>
+      <p class='sub'>
+        <strong>Optional bonus panel.</strong> This Space can stream the
+        fine-tuned SFT model's decisions in real time when a Hugging Face
+        Inference Endpoint is attached. {llm_status.get('reason', '')}
+      </p>
+      <details>
+        <summary class='sub'>How the owner enables it</summary>
+        <ol>
+          <li>Upload the SFT checkpoint from <code>artifacts/sft_model/</code> to a model repo on the Hub.</li>
+          <li>Create a dedicated <a href='https://huggingface.co/inference-endpoints' target='_blank' rel='noopener'>Inference Endpoint</a> (T4 small is enough).</li>
+          <li>Set <code>LLM_ENDPOINT_URL</code> and <code>HF_TOKEN</code> as secrets on this Space.</li>
+          <li>Restart the Space — this panel turns on automatically.</li>
+        </ol>
+      </details>
+    </div>
+"""
     # --- Reward-rubric details ----------------------------------------------
     reward_rubric_rows = "".join(
         f"<tr><td><code>{name}</code></td><td>{value}</td></tr>"
     td.delta.good {{ color: var(--good); }}
     .links {{ display:flex; flex-wrap:wrap; gap:0.5rem; }}
+    /* Live-inference panel (fine-tuned SFT model behind HF Inference Endpoint). */
+    .live-controls {{
+      display:flex; flex-wrap:wrap; gap:1rem; align-items:center;
+      margin:0.75rem 0 1rem;
+    }}
+    .live-controls label {{
+      display:flex; flex-direction:column; gap:0.2rem;
+      font-size:0.8rem; color:var(--muted);
+    }}
+    .live-controls select, .live-controls input {{
+      background:#0b1225; border:1px solid #1f2a44; color:var(--text);
+      border-radius:8px; padding:0.35rem 0.55rem; font-size:0.9rem; min-width:110px;
+    }}
+    .live-controls button.pill.cta {{ cursor:pointer; border:0; }}
+    .live-controls button.pill.cta:disabled {{ opacity:0.6; cursor:wait; }}
+    .live-grid {{
+      display:grid; grid-template-columns: repeat(auto-fit, minmax(360px, 1fr));
+      gap:0.9rem; margin-top:0.5rem;
+    }}
+    .live-grid h4 {{
+      margin:0 0 0.3rem; font-size:0.85rem; color:#cbd5e1;
+      text-transform:uppercase; letter-spacing:0.04em;
+    }}
+    .live-grid .live-grid-full {{ grid-column: 1 / -1; }}
+    .live-grid pre {{
+      background:#0b1225; border:1px solid #1f2a44; border-radius:10px;
+      padding:0.75rem; margin:0; font-size:0.82rem; line-height:1.45;
+      max-height:320px; overflow:auto; white-space:pre-wrap; word-wrap:break-word;
+    }}
+    .live-error {{
+      background:#2a1418; border:1px solid #ef444455; color:#fca5a5;
+      border-radius:10px; padding:0.75rem; margin-top:0.75rem; font-size:0.9rem;
+    }}
     /* "Story in 2 minutes" hero panel — plain-English summary for judges. */
     .hero-card {{
       background: linear-gradient(135deg, #0f2647 0%, #172a4a 60%, #1f2a44 100%);
       <h3 style='margin-top:1.25rem'>What is the environment?</h3>
       <p class='sub' style='margin:0 0 0.75rem'>
         Three specialist agents with <strong>different permissions</strong> resolve
+        a live queue drawn from <strong>30 realistic tech incident templates</strong>
+        across 3 difficulty tiers.
       </p>
       <div class='table-wrap'>
         <table>
     {ablation_html}
+    {live_panel_html}
     {themes_html}
     <h2>Endpoints</h2>
       const total = Object.values(data.incidents_per_task || {{}}).reduce((a,b)=>a+b,0);
       document.getElementById('kpi-inc').textContent = total;
     }} catch (e) {{}}
+    // Live fine-tuned-model demo. Only runs if the panel is rendered.
+    (function() {{
+      const runBtn = document.getElementById('live-run');
+      if (!runBtn) return;
+      const taskSel = document.getElementById('live-task');
+      const seedInp = document.getElementById('live-seed');
+      const out     = document.getElementById('live-output');
+      const err     = document.getElementById('live-error');
+      const obsPre  = document.getElementById('live-obs-before');
+      const promptPre = document.getElementById('live-prompt');
+      const rawPre  = document.getElementById('live-raw');
+      const actPre  = document.getElementById('live-action');
+      const stepPre = document.getElementById('live-step');
+      function showError(msg) {{
+        err.textContent = msg;
+        err.hidden = false;
+        out.hidden = true;
+      }}
+      function renderOutput(data) {{
+        err.hidden = true;
+        obsPre.textContent = JSON.stringify(data.observation_before || {{}}, null, 2);
+        promptPre.textContent = data.prompt || '';
+        rawPre.textContent = data.raw_llm_response || '(empty response)';
+        const fallbackTag = data.fallback_used
+          ? '// NOTE: LLM JSON was invalid — safe fallback action was used instead.\\n'
+          : '';
+        actPre.textContent = fallbackTag + JSON.stringify(data.validated_action || {{}}, null, 2);
+        stepPre.textContent = JSON.stringify(data.step_result || {{}}, null, 2);
+        out.hidden = false;
+      }}
+      runBtn.addEventListener('click', async () => {{
+        runBtn.disabled = true;
+        const label = runBtn.textContent;
+        runBtn.textContent = '⏳ Calling model…';
+        try {{
+          const resp = await fetch('/llm-demo', {{
+            method: 'POST',
+            headers: {{'Content-Type': 'application/json'}},
+            body: JSON.stringify({{
+              task_name: taskSel.value,
+              seed: Number(seedInp.value) || 0
+            }})
+          }});
+          const data = await resp.json();
+          if (!resp.ok) {{
+            showError((data && data.error) ? data.error : ('HTTP ' + resp.status));
+          }} else {{
+            renderOutput(data);
+          }}
+        }} catch (e) {{
+          showError('Network error: ' + e.message);
+        }} finally {{
+          runBtn.disabled = false;
+          runBtn.textContent = label;
+        }}
+      }});
+    }})();
   </script>
 </body>
 </html>

server/domain/incidents.py CHANGED Viewed

@@ -850,17 +850,885 @@ def _deadlock_database() -> IncidentTemplate:
     )
 def build_incident_library() -> IncidentLibrary:
-    """Return the built-in enterprise incident library."""
     return IncidentLibrary(
         templates_by_task={
-            "easy": [_redis_pool(), _jwt_clock_skew(), _email_spam_false_positive()],
             "medium": [
                 _cache_invalidation_lag(),
                 _tz_normalization(),
                 _invoice_idempotency(),
                 _tls_expiry(),
                 _feature_flag_rollout(),
             ],
             "hard": [
                 _promo_rate_cascade(),
@@ -868,6 +1736,12 @@ def build_incident_library() -> IncidentLibrary:
                 _alert_storm(),
                 _inventory_race(),
                 _deadlock_database(),
             ],
         }
     )

     )
+# ---------------------------------------------------------------------------
+# Extended catalog (round-2 polish)
+#
+# 17 additional templates balance the tier mix (free / standard / premium /
+# enterprise), add new service dimensions (DNS, CDN, ML inference, storage,
+# message queue, config distribution) and new failure modes (GPU memory leaks,
+# replication saturation, cache key collisions, firmware regressions, DST
+# bugs). Each template follows the same pattern as INC-E1..H5 so the reward
+# rubric, environment plumbing and training scripts require no changes.
+# ---------------------------------------------------------------------------
+def _dns_ttl_stale() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-E4",
+        title="Stale DNS routes free-tier API traffic to drained region",
+        description=(
+            "Free-tier API callers keep hitting a drained region even after "
+            "a planned failover because DNS TTLs have not expired."
+        ),
+        category="networking",
+        difficulty="easy",
+        root_cause="dns_ttl_stale_after_failover",
+        root_cause_synonyms=(
+            "dns ttl stale after failover",
+            "stale dns record",
+            "long ttl blocking failover",
+        ),
+        clue_keywords=("dns", "ttl", "failover", "drain"),
+        signals=(
+            "Traffic ratio to drained region stays above 30% 30 minutes post-failover",
+            "Only free-tier resolvers (no Anycast) are affected",
+        ),
+        logs={
+            "dns-edge": "A record TTL=3600s still cached at regional resolvers",
+            "traffic-router": "Residual traffic observed on drained region us-west-2b",
+        },
+        red_herring_logs={
+            "payments-api": "steady 2xx",
+        },
+        metrics={
+            "dash-dns": "ttl_expired_ratio 0.71 (expected >0.95)",
+            "dash-router": "drained_region_share 34%",
+        },
+        red_herring_metrics={
+            "dash-cdn": "hit_ratio 95%",
+        },
+        kb={
+            "kb-dns-ttl": "Pre-lower TTL to 60s at least 2 TTLs before planned failovers.",
+        },
+        good_handoff="triage_agent",
+        accepted_fix_keywords=(
+            ("shorten", "dns", "ttl"),
+            ("force", "resolver", "refresh"),
+            ("rollback", "region", "drain"),
+        ),
+        required_investigations=1,
+        customer_tier="free",
+        affected_users_estimate=2_500,
+        revenue_impact_usd_per_min=15,
+        requires_mitigation=True,
+    )
+def _cdn_purge_scope() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-E5",
+        title="CDN purge missed a hot asset after release",
+        description=(
+            "A marketing banner refresh missed a subset of CDN edges, so a "
+            "fraction of standard-tier users see the old creative."
+        ),
+        category="cdn",
+        difficulty="easy",
+        root_cause="cdn_purge_scope_mismatch",
+        root_cause_synonyms=(
+            "cdn purge scope mismatch",
+            "edge purge partial",
+            "shield purge missed",
+        ),
+        clue_keywords=("cdn", "purge", "edge", "shield"),
+        signals=(
+            "Small but persistent share of stale banner impressions",
+            "Affected edges cluster on a single PoP provider",
+        ),
+        logs={
+            "cdn-control-plane": "Purge job completed with 14 edges skipped (policy=legacy)",
+            "edge-pop-bom-1": "Serving banner_v12 while origin is on banner_v13",
+        },
+        metrics={
+            "dash-cdn": "stale_object_rate 1.4%, edge_sync_lag_s 312",
+        },
+        red_herring_metrics={
+            "dash-auth": "401_rate 0.2%",
+        },
+        kb={
+            "kb-cdn-purge": "Always use wildcard purge with full edge fanout for visual assets.",
+        },
+        good_handoff="investigator_agent",
+        accepted_fix_keywords=(
+            ("reissue", "cdn", "purge"),
+            ("fanout", "edge", "invalidation"),
+            ("rotate", "asset", "hash"),
+        ),
+        required_investigations=1,
+        customer_tier="standard",
+        affected_users_estimate=11_000,
+        revenue_impact_usd_per_min=60,
+        requires_mitigation=True,
+    )
+def _autocomplete_stale() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-E6",
+        title="Search autocomplete missing this week's products",
+        description=(
+            "Free-tier shoppers see a stale autocomplete list that does not "
+            "surface new SKUs released this Monday."
+        ),
+        category="search",
+        difficulty="easy",
+        root_cause="autocomplete_index_rebuild_skipped",
+        root_cause_synonyms=(
+            "autocomplete index rebuild skipped",
+            "suggestion index stale",
+            "nightly reindex missed",
+        ),
+        clue_keywords=("autocomplete", "index", "reindex", "suggestion"),
+        signals=(
+            "New SKUs launched Monday never appear in suggest responses",
+            "Full text search returns them correctly",
+        ),
+        logs={
+            "suggest-indexer": "Scheduled rebuild skipped (upstream lock held)",
+            "suggest-api": "Serving snapshot v88 (expected v91)",
+        },
+        red_herring_logs={
+            "payments-api": "steady 2xx",
+        },
+        metrics={
+            "dash-suggest": "index_version 88, target_version 91",
+            "dash-search": "full_text_recall 99%, autocomplete_recall 71%",
+        },
+        kb={
+            "kb-autocomplete": "Reindex lock must release on job exit and alert on missed window.",
+        },
+        good_handoff="ops_manager_agent",
+        accepted_fix_keywords=(
+            ("force", "index", "rebuild"),
+            ("release", "reindex", "lock"),
+            ("promote", "suggestion", "snapshot"),
+        ),
+        required_investigations=1,
+        customer_tier="free",
+        affected_users_estimate=18_000,
+        revenue_impact_usd_per_min=30,
+        requires_mitigation=True,
+    )
+def _webhook_retry_budget() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-E7",
+        title="Partner webhooks silently dropping",
+        description=(
+            "A handful of partner integrations stopped receiving webhook "
+            "deliveries after a downstream 429 spike."
+        ),
+        category="integrations",
+        difficulty="easy",
+        root_cause="webhook_retry_budget_exhausted",
+        root_cause_synonyms=(
+            "webhook retry budget exhausted",
+            "partner webhook giving up",
+            "429 retry exhaustion",
+        ),
+        clue_keywords=("webhook", "retry", "429", "budget"),
+        signals=(
+            "Deliveries succeed for some partners and silently fail for others",
+            "Affected partners all share a single rate-limit bucket",
+        ),
+        logs={
+            "webhook-dispatcher": "Retry budget exhausted for partner_bucket=bucket-7",
+            "partner-gateway": "HTTP 429 for 22 consecutive attempts on bucket-7",
+        },
+        red_herring_logs={
+            "catalog-api": "steady 2xx",
+        },
+        metrics={
+            "dash-webhooks": "delivery_success_bucket7 34%, retry_budget_remaining 0",
+        },
+        kb={
+            "kb-webhook-retry": "Split rate-limit buckets per partner and reset retry budgets on recovery.",
+        },
+        good_handoff="ops_manager_agent",
+        accepted_fix_keywords=(
+            ("split", "retry", "bucket"),
+            ("reset", "retry", "budget"),
+            ("pause", "partner", "bucket"),
+        ),
+        required_investigations=2,
+        customer_tier="standard",
+        affected_users_estimate=1_400,
+        revenue_impact_usd_per_min=80,
+        requires_mitigation=True,
+    )
+def _thumbnail_worker_oom() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-E8",
+        title="User profile thumbnails render blank on mobile",
+        description=(
+            "Free-tier mobile users see empty circles where their profile "
+            "photo should appear, intermittently."
+        ),
+        category="media",
+        difficulty="easy",
+        root_cause="thumbnail_worker_oom_killed",
+        root_cause_synonyms=(
+            "thumbnail worker oom killed",
+            "image worker out of memory",
+            "thumbnailer oom loop",
+        ),
+        clue_keywords=("thumbnail", "oom", "memory", "worker"),
+        signals=(
+            "Missing thumbnails correlate with HEIC uploads from newer devices",
+            "CPU is normal but worker restart count is spiking",
+        ),
+        logs={
+            "thumbnail-worker": "SIGKILL received (oom_score_adj=500)",
+            "image-pipeline": "HEIC decoder peak rss 1.9GB on large uploads",
+        },
+        metrics={
+            "dash-thumbnails": "render_success 82%, worker_restarts 240/hr",
+            "dash-k8s": "pod_oom_kill_count 42",
+        },
+        kb={
+            "kb-thumbnail": "Cap HEIC decode memory or reject above 30MP at the edge.",
+        },
+        good_handoff="triage_agent",
+        accepted_fix_keywords=(
+            ("raise", "memory", "limit"),
+            ("reject", "oversized", "heic"),
+            ("downscale", "before", "decode"),
+        ),
+        required_investigations=2,
+        customer_tier="free",
+        affected_users_estimate=55_000,
+        revenue_impact_usd_per_min=20,
+        requires_mitigation=True,
+    )
+def _recommender_heap_leak() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-M6",
+        title="Recommender latency drifts up after model swap",
+        description=(
+            "Homepage recommendation latency is drifting up over six hours "
+            "since this morning's model swap. p99 is now 2.1s."
+        ),
+        category="recommendations",
+        difficulty="medium",
+        root_cause="recommender_heap_leak_after_model_swap",
+        root_cause_synonyms=(
+            "recommender heap leak after model swap",
+            "embedding cache not released",
+            "old model tensors pinned",
+        ),
+        clue_keywords=("heap", "leak", "embedding", "model", "swap"),
+        signals=(
+            "Heap utilisation climbs 2% / hour since deploy",
+            "Full GC frequency doubled but does not recover memory",
+        ),
+        logs={
+            "recommender-service": "Loaded model v42; previous tensors not released",
+            "jvm-gc": "Old gen occupancy 88% after full GC",
+        },
+        red_herring_logs={
+            "catalog-api": "steady 2xx",
+        },
+        metrics={
+            "dash-recommender": "p99_latency_ms 2100, heap_used_pct 88",
+            "dash-jvm": "full_gc_per_min 4, reclaimed_bytes_low",
+        },
+        red_herring_metrics={
+            "dash-search": "ctr steady",
+        },
+        kb={
+            "kb-model-swap": "Release previous model tensors explicitly before binding the new one.",
+        },
+        good_handoff="investigator_agent",
+        accepted_fix_keywords=(
+            ("release", "previous", "model"),
+            ("unload", "embedding", "cache"),
+            ("rollback", "model", "swap"),
+        ),
+        required_investigations=2,
+        customer_tier="premium",
+        affected_users_estimate=95_000,
+        revenue_impact_usd_per_min=410,
+        requires_mitigation=True,
+    )
+def _consumer_group_rebalance() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-M7",
+        title="Order events stuck behind consumer rebalance storm",
+        description=(
+            "Order processing lag spiked after a rolling restart and has not "
+            "recovered; fresh orders are 90s behind real time."
+        ),
+        category="messaging",
+        difficulty="medium",
+        root_cause="consumer_group_rebalance_storm",
+        root_cause_synonyms=(
+            "consumer group rebalance storm",
+            "kafka consumer thrashing",
+            "repeated partition reassignment",
+        ),
+        clue_keywords=("kafka", "consumer", "rebalance", "partition"),
+        signals=(
+            "Consumer group rebalanced 11 times in 5 minutes",
+            "Lag stuck even though CPU is at 30%",
+        ),
+        logs={
+            "order-consumer": "Rebalance triggered: member id rotated, session timeout=10s",
+            "kafka-coordinator": "Generation 412 -> 423 in 5m, partitions churning",
+        },
+        red_herring_logs={
+            "auth-service": "normal 2xx",
+        },
+        metrics={
+            "dash-orders": "consumer_lag 90s, rebalance_count_5m 11",
+            "dash-kafka": "generation_rotations 2.2/min",
+        },
+        kb={
+            "kb-consumer-tuning": "Raise session.timeout.ms and heartbeat.interval.ms to avoid false expulsion.",
+        },
+        good_handoff="ops_manager_agent",
+        accepted_fix_keywords=(
+            ("raise", "session", "timeout"),
+            ("pin", "static", "membership"),
+            ("stabilise", "consumer", "group"),
+        ),
+        required_investigations=2,
+        customer_tier="premium",
+        affected_users_estimate=48_000,
+        revenue_impact_usd_per_min=520,
+        requires_mitigation=True,
+    )
+def _config_push_skipped_canary() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-M8",
+        title="Enterprise tenants hit TLS verify failures after config push",
+        description=(
+            "A global config change flipped a TLS verification flag in "
+            "production without going through canary."
+        ),
+        category="platform",
+        difficulty="medium",
+        root_cause="config_push_skipped_canary",
+        root_cause_synonyms=(
+            "config push skipped canary",
+            "global config bypassed stage",
+            "bulk config rollout regression",
+        ),
+        clue_keywords=("config", "canary", "push", "rollout"),
+        signals=(
+            "Enterprise tenants see TLS verify errors 3 minutes after deploy",
+            "Canary stage shows zero traffic for this change",
+        ),
+        logs={
+            "config-service": "Changeset CR-8812 applied globally (stages=[])",
+            "api-gateway": "TLS verify flag=strict caused downstream handshake failures",
+        },
+        red_herring_logs={
+            "email-service": "no anomalies",
+        },
+        metrics={
+            "dash-config": "canary_coverage 0%, rollout_surface 100%",
+            "dash-gateway": "tls_verify_failures 8.3%",
+        },
+        kb={
+            "kb-config-rollout": "Require canary + 15 minutes bake before promoting config changes.",
+        },
+        good_handoff="ops_manager_agent",
+        accepted_fix_keywords=(
+            ("rollback", "config", "change"),
+            ("re-enable", "canary", "stage"),
+            ("revert", "tls", "flag"),
+        ),
+        required_investigations=2,
+        customer_tier="enterprise",
+        affected_users_estimate=2_100,
+        revenue_impact_usd_per_min=640,
+        requires_mitigation=True,
+        postmortem_required=True,
+    )
+def _health_check_flapping() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-M9",
+        title="Autoscaler thrashing under brief latency blips",
+        description=(
+            "Autoscaler is adding and removing pods every 2 minutes in "
+            "response to very short latency blips."
+        ),
+        category="platform",
+        difficulty="medium",
+        root_cause="health_check_timeout_too_aggressive",
+        root_cause_synonyms=(
+            "health check timeout too aggressive",
+            "liveness probe too tight",
+            "autoscaler oscillating",
+        ),
+        clue_keywords=("health", "check", "liveness", "autoscaler"),
+        signals=(
+            "Pod churn 6x baseline with no underlying load change",
+            "Brief p99 blips align with scale events, not incidents",
+        ),
+        logs={
+            "kubelet": "Liveness probe failed: HTTP 500 after 800ms",
+            "autoscaler": "Scale up triggered; 3 pods added, 2 removed within 2m",
+        },
+        red_herring_logs={
+            "payments-api": "steady 2xx",
+        },
+        metrics={
+            "dash-k8s": "pod_churn_per_min 9, cpu_avg 42%",
+            "dash-slo": "p99_latency_ms spikes tied to scale events",
+        },
+        kb={
+            "kb-health-probe": "Raise liveness timeout and stagger readiness to avoid flap-driven scale events.",
+        },
+        good_handoff="triage_agent",
+        accepted_fix_keywords=(
+            ("raise", "probe", "timeout"),
+            ("dampen", "autoscaler", "cooldown"),
+            ("relax", "liveness", "threshold"),
+        ),
+        required_investigations=2,
+        customer_tier="standard",
+        affected_users_estimate=31_000,
+        revenue_impact_usd_per_min=210,
+        requires_mitigation=True,
+    )
+def _payment_webhook_dedupe() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-M10",
+        title="Payment confirmations delivered twice to enterprise partners",
+        description=(
+            "Two enterprise payment partners received the same confirmation "
+            "webhook twice for a subset of transactions."
+        ),
+        category="payments",
+        difficulty="medium",
+        root_cause="webhook_dedupe_window_too_narrow",
+        root_cause_synonyms=(
+            "webhook dedupe window too narrow",
+            "payment webhook duplicate delivery",
+            "idempotency window clock drift",
+        ),
+        clue_keywords=("webhook", "dedupe", "idempotency", "window"),
+        signals=(
+            "Duplicates concentrated on retries across failover boundary",
+            "Dedupe cache TTL is shorter than retry backoff",
+        ),
+        logs={
+            "payments-webhook": "Duplicate delivery for txn T-332a after dedupe cache eviction",
+            "scheduler": "Retry backoff 90s; dedupe ttl=60s",
+        },
+        red_herring_logs={
+            "email-service": "steady",
+        },
+        metrics={
+            "dash-payments": "duplicate_webhook_rate 0.9%, dedupe_hit_rate 88%",
+        },
+        kb={
+            "kb-webhook-dedupe": "Dedupe TTL must exceed the maximum retry backoff window.",
+        },
+        good_handoff="investigator_agent",
+        accepted_fix_keywords=(
+            ("extend", "dedupe", "ttl"),
+            ("shrink", "retry", "backoff"),
+            ("persist", "dedupe", "store"),
+        ),
+        required_investigations=2,
+        customer_tier="enterprise",
+        affected_users_estimate=620,
+        revenue_impact_usd_per_min=480,
+        requires_mitigation=True,
+        postmortem_required=True,
+    )
+def _origin_shield_bypass() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-M11",
+        title="Origin overloaded after CDN policy change",
+        description=(
+            "Origin servers are seeing 5x normal traffic because a CDN "
+            "policy change disabled origin shield for a large segment."
+        ),
+        category="cdn",
+        difficulty="medium",
+        root_cause="origin_shield_bypass_after_policy_change",
+        root_cause_synonyms=(
+            "origin shield bypass after policy change",
+            "shield disabled for segment",
+            "cache hierarchy collapsed",
+        ),
+        clue_keywords=("origin", "shield", "cdn", "policy"),
+        signals=(
+            "Origin 5xx rate climbs as CDN hit ratio collapses",
+            "New CDN policy rolled out exactly at fault onset",
+        ),
+        logs={
+            "cdn-policy": "Policy v5 removed shield targeting for premium segment",
+            "origin-lb": "Connection queue depth spiking 5x baseline",
+        },
+        red_herring_logs={
+            "dns-resolver": "no anomalies",
+        },
+        metrics={
+            "dash-cdn": "hit_ratio 67% (baseline 94%)",
+            "dash-origin": "rps 5.2x baseline, 5xx_rate 7.1%",
+        },
+        kb={
+            "kb-origin-shield": "Changes to shield routing must go through shadow traffic before promotion.",
+        },
+        good_handoff="investigator_agent",
+        accepted_fix_keywords=(
+            ("rollback", "cdn", "policy"),
+            ("re-enable", "origin", "shield"),
+            ("route", "through", "shield"),
+        ),
+        required_investigations=3,
+        customer_tier="premium",
+        affected_users_estimate=240_000,
+        revenue_impact_usd_per_min=1_300,
+        requires_mitigation=True,
+        postmortem_required=True,
+    )
+def _gpu_memory_fragmentation() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-H6",
+        title="LLM inference latency drifts up on production A100 pool",
+        description=(
+            "Enterprise API latency for the inference gateway has drifted "
+            "from 420ms to 1.4s over 36 hours, with OOMs on larger prompts."
+        ),
+        category="ml_inference",
+        difficulty="hard",
+        root_cause="gpu_memory_fragmentation_after_prompt_schema_change",
+        root_cause_synonyms=(
+            "gpu memory fragmentation after prompt schema change",
+            "kv cache fragmentation",
+            "inference pool memory fragmentation",
+        ),
+        clue_keywords=("gpu", "memory", "fragmentation", "kv", "cache"),
+        signals=(
+            "Free VRAM fragmented into small blocks even though total free > 18GB",
+            "OOM errors concentrate on prompts >2k tokens",
+        ),
+        logs={
+            "inference-gateway": "CUDA OOM despite torch reports 18GB free; fragmentation detected",
+            "model-runner": "Prompt schema v3 increased variable sequence lengths",
+        },
+        red_herring_logs={
+            "auth-service": "steady",
+        },
+        metrics={
+            "dash-inference": "p99_latency_ms 1400, oom_rate 3.2%",
+            "dash-gpu": "vram_fragmentation_score 0.74",
+        },
+        kb={
+            "kb-vram": "Recycle inference workers daily and pad sequences to bucketed lengths.",
+        },
+        good_handoff="investigator_agent",
+        accepted_fix_keywords=(
+            ("recycle", "inference", "workers"),
+            ("bucket", "prompt", "lengths"),
+            ("rollback", "prompt", "schema"),
+        ),
+        required_investigations=3,
+        customer_tier="enterprise",
+        affected_users_estimate=5_200,
+        revenue_impact_usd_per_min=1_850,
+        requires_mitigation=True,
+        postmortem_required=True,
+    )
+def _replication_saturation() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-H7",
+        title="Cross-region replication lag blocks disaster-recovery RPO",
+        description=(
+            "Replication lag from the primary region to DR has exceeded "
+            "five minutes for the last hour, violating RPO=60s."
+        ),
+        category="data",
+        difficulty="hard",
+        root_cause="replication_saturation_during_backup_window",
+        root_cause_synonyms=(
+            "replication saturation during backup window",
+            "wal shipping backpressure",
+            "replica network saturation",
+        ),
+        clue_keywords=("replication", "lag", "wal", "rpo", "backup"),
+        signals=(
+            "Lag correlates exactly with nightly backup window",
+            "Network egress saturated on primary -> DR link",
+        ),
+        logs={
+            "db-primary": "WAL shipping backpressure; replica slot lagging 6.2m",
+            "backup-job": "Base backup in progress; 4.1 GB/s read rate",
+        },
+        red_herring_logs={
+            "notification-gateway": "steady delivery",
+        },
+        metrics={
+            "dash-replication": "lag_seconds 372 (rpo=60)",
+            "dash-network": "egress_primary_to_dr 9.8 Gbps (cap=10)",
+        },
+        kb={
+            "kb-replication-backup": "Throttle backup or move it off hours of peak replication traffic.",
+        },
+        good_handoff="ops_manager_agent",
+        accepted_fix_keywords=(
+            ("throttle", "backup", "rate"),
+            ("shift", "backup", "window"),
+            ("raise", "replication", "bandwidth"),
+        ),
+        required_investigations=3,
+        customer_tier="enterprise",
+        affected_users_estimate=8_900,
+        revenue_impact_usd_per_min=1_400,
+        requires_mitigation=True,
+        postmortem_required=True,
+    )
+def _cache_key_collision() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-H8",
+        title="Cross-tenant data bleed from cache key collision",
+        description=(
+            "A rare cache key collision is briefly returning one enterprise "
+            "tenant's data to another. This is a data-isolation incident."
+        ),
+        category="security",
+        difficulty="hard",
+        root_cause="cache_key_collision_across_tenants",
+        root_cause_synonyms=(
+            "cache key collision across tenants",
+            "shared cache tenant bleed",
+            "tenant id missing from cache key",
+        ),
+        clue_keywords=("cache", "key", "collision", "tenant"),
+        signals=(
+            "Two enterprise tenants report seeing each other's dashboard metadata",
+            "Cache key construction omits tenant-id under a specific code path",
+        ),
+        logs={
+            "api-gateway": "Cache HIT for key=/v2/workspace/42 served to tenant=91",
+            "cache-layer": "Collision detected between tenants 42 and 91 on key prefix /v2/workspace",
+        },
+        red_herring_logs={
+            "email-service": "steady",
+        },
+        metrics={
+            "dash-cache": "collision_count 14 in last 2h",
+            "dash-security": "isolation_violations 2",
+        },
+        kb={
+            "kb-cache-tenant": "Prefix every cache key with tenant_id and enforce via lint check.",
+        },
+        good_handoff="ops_manager_agent",
+        accepted_fix_keywords=(
+            ("prefix", "tenant", "cache"),
+            ("invalidate", "shared", "cache"),
+            ("quarantine", "cache", "segment"),
+        ),
+        required_investigations=3,
+        customer_tier="enterprise",
+        affected_users_estimate=320,
+        revenue_impact_usd_per_min=2_100,
+        requires_mitigation=True,
+        postmortem_required=True,
+    )
+def _cron_dst_double_trigger() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-H9",
+        title="Scheduled jobs fire twice at DST rollover",
+        description=(
+            "Key premium billing jobs executed twice at the daylight-saving "
+            "transition, causing premium charge duplicates."
+        ),
+        category="scheduling",
+        difficulty="hard",
+        root_cause="cron_dst_transition_double_trigger",
+        root_cause_synonyms=(
+            "cron dst transition double trigger",
+            "scheduler timezone ambiguity",
+            "dst fallback replay",
+        ),
+        clue_keywords=("cron", "dst", "timezone", "scheduler"),
+        signals=(
+            "Job history shows two runs at 01:00 and 01:00 local time",
+            "Billing duplicates concentrate on a single geographic region",
+        ),
+        logs={
+            "scheduler": "Fired job billing.nightly at 2026-03-29 01:00 (GMT+1 and GMT+0)",
+            "billing-worker": "Second invocation completed 12 minutes after first",
+        },
+        red_herring_logs={
+            "catalog-api": "steady 2xx",
+        },
+        metrics={
+            "dash-scheduler": "double_fire_count 3 (expected 0)",
+            "dash-billing": "duplicate_charge_rate 2.1%",
+        },
+        kb={
+            "kb-dst-schedule": "Anchor scheduled jobs on UTC and convert to local time at display only.",
+        },
+        good_handoff="investigator_agent",
+        accepted_fix_keywords=(
+            ("anchor", "schedule", "utc"),
+            ("deduplicate", "scheduled", "runs"),
+            ("reconcile", "duplicate", "charges"),
+        ),
+        required_investigations=3,
+        customer_tier="premium",
+        affected_users_estimate=6_400,
+        revenue_impact_usd_per_min=1_100,
+        requires_mitigation=True,
+        postmortem_required=True,
+    )
+def _partial_publish_feed() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-H10",
+        title="Real-time feed gaps during partial publish",
+        description=(
+            "Premium trading-floor customers see gaps in the realtime price "
+            "feed after a publisher restart; some updates never arrived."
+        ),
+        category="realtime",
+        difficulty="hard",
+        root_cause="partial_publish_without_transaction_boundary",
+        root_cause_synonyms=(
+            "partial publish without transaction boundary",
+            "publisher crash mid batch",
+            "realtime feed gap",
+        ),
+        clue_keywords=("publish", "transaction", "feed", "partial"),
+        signals=(
+            "Sequence numbers skip in a bounded window around the publisher restart",
+            "Replay API can fill the gap but live subscribers missed it",
+        ),
+        logs={
+            "price-publisher": "Process restarted mid-batch, seq=88230 not flushed",
+            "realtime-bus": "Detected sequence gap 88230-88236 on channel=prices.us",
+        },
+        red_herring_logs={
+            "auth-service": "steady",
+        },
+        metrics={
+            "dash-realtime": "gap_count 6 in 30s, subscriber_reconcile_lag_s 48",
+        },
+        kb={
+            "kb-publish-txn": "Wrap each batch in a transactional publish so crashes never leave gaps.",
+        },
+        good_handoff="investigator_agent",
+        accepted_fix_keywords=(
+            ("enable", "transactional", "publish"),
+            ("replay", "sequence", "gap"),
+            ("force", "subscriber", "reconcile"),
+        ),
+        required_investigations=3,
+        customer_tier="premium",
+        affected_users_estimate=3_900,
+        revenue_impact_usd_per_min=1_750,
+        requires_mitigation=True,
+        postmortem_required=True,
+    )
+def _ssd_firmware_regression() -> IncidentTemplate:
+    return IncidentTemplate(
+        id="INC-H11",
+        title="Storage checksum failures on upgraded SSD fleet",
+        description=(
+            "Enterprise object storage is returning checksum-mismatch errors "
+            "on a subset of volumes after a firmware roll-forward."
+        ),
+        category="storage",
+        difficulty="hard",
+        root_cause="ssd_firmware_checksum_regression",
+        root_cause_synonyms=(
+            "ssd firmware checksum regression",
+            "storage firmware corruption",
+            "nvme firmware crc bug",
+        ),
+        clue_keywords=("firmware", "ssd", "checksum", "storage"),
+        signals=(
+            "Checksum failures concentrate on volumes upgraded in the last 72 hours",
+            "Vendor advisory mentions similar symptoms after firmware F2.14",
+        ),
+        logs={
+            "storage-agent": "CRC mismatch on volume vol-221 firmware=F2.14",
+            "fleet-manager": "Upgrade batch included F2.14 for 18 volumes",
+        },
+        red_herring_logs={
+            "email-service": "steady",
+        },
+        metrics={
+            "dash-storage": "checksum_error_rate 0.8%",
+            "dash-fleet": "volumes_on_F2.14 18, volumes_healthy 402",
+        },
+        kb={
+            "kb-ssd-firmware": "Quarantine affected firmware and roll back to the last known-good version.",
+        },
+        good_handoff="ops_manager_agent",
+        accepted_fix_keywords=(
+            ("rollback", "ssd", "firmware"),
+            ("quarantine", "affected", "volumes"),
+            ("reseed", "checksum", "index"),
+        ),
+        required_investigations=3,
+        customer_tier="enterprise",
+        affected_users_estimate=1_800,
+        revenue_impact_usd_per_min=1_950,
+        requires_mitigation=True,
+        postmortem_required=True,
+    )
 def build_incident_library() -> IncidentLibrary:
+    """Return the built-in enterprise incident library (30 templates)."""
     return IncidentLibrary(
         templates_by_task={
+            "easy": [
+                _redis_pool(),
+                _jwt_clock_skew(),
+                _email_spam_false_positive(),
+                _dns_ttl_stale(),
+                _cdn_purge_scope(),
+                _autocomplete_stale(),
+                _webhook_retry_budget(),
+                _thumbnail_worker_oom(),
+            ],
             "medium": [
                 _cache_invalidation_lag(),
                 _tz_normalization(),
                 _invoice_idempotency(),
                 _tls_expiry(),
                 _feature_flag_rollout(),
+                _recommender_heap_leak(),
+                _consumer_group_rebalance(),
+                _config_push_skipped_canary(),
+                _health_check_flapping(),
+                _payment_webhook_dedupe(),
+                _origin_shield_bypass(),
             ],
             "hard": [
                 _promo_rate_cascade(),
                 _alert_storm(),
                 _inventory_race(),
                 _deadlock_database(),
+                _gpu_memory_fragmentation(),
+                _replication_saturation(),
+                _cache_key_collision(),
+                _cron_dst_double_trigger(),
+                _partial_publish_feed(),
+                _ssd_firmware_regression(),
             ],
         }
     )

server/llm_remote.py ADDED Viewed

	@@ -0,0 +1,193 @@

+"""Thin client for calling a remote LLM from the FastAPI server.
+Used by the dashboard's "live inference" panel so a Hugging Face Space can
+delegate the expensive forward pass to a dedicated HF Inference Endpoint
+(GPU-backed) without loading the model inside the Space container.
+Two backends are supported:
+- ``chat`` (default) — OpenAI-compatible ``/v1/chat/completions`` endpoint.
+  Hugging Face TGI-based Inference Endpoints expose this path, as do most
+  vLLM deployments. This is the recommended setup.
+- ``generate`` — Raw TGI ``/generate`` endpoint. Useful when chat templating
+  is already baked into the prompt and you just want raw text completion.
+Configuration via environment variables (set them as HF Space secrets):
+- ``LLM_ENDPOINT_URL``  — **required** to enable the panel. E.g.
+  ``https://abc.us-east-1.aws.endpoints.huggingface.cloud``. Without this
+  env var, ``is_configured()`` returns ``False`` and the dashboard shows a
+  setup hint instead of the demo.
+- ``HF_TOKEN``          — **required**. A Hugging Face token with ``read``
+  scope over the model repo powering the endpoint.
+- ``LLM_ENDPOINT_MODE`` — optional, one of ``chat`` / ``generate``
+  (default: ``chat``).
+- ``LLM_MODEL_ID``      — optional display / routing hint the endpoint
+  sometimes cares about (default: ``"tgi"``).
+- ``LLM_MAX_NEW_TOKENS``— optional integer (default: ``160``).
+- ``LLM_TIMEOUT_S``     — optional integer (default: ``25``).
+The module uses only the Python stdlib (``urllib.request``) so it adds
+zero extra dependencies to the HF Space Docker image.
+"""
+from __future__ import annotations
+import json
+import logging
+import os
+import socket
+import urllib.error
+import urllib.request
+from dataclasses import dataclass
+from typing import Any, Dict, Optional
+_LOG = logging.getLogger("icc.llm_remote")
+@dataclass(frozen=True)
+class RemoteLLMConfig:
+    endpoint_url: str
+    token: str
+    mode: str = "chat"          # "chat" | "generate"
+    model_id: str = "tgi"
+    max_new_tokens: int = 160
+    timeout_s: int = 25
+    @classmethod
+    def from_env(cls) -> Optional["RemoteLLMConfig"]:
+        url = os.environ.get("LLM_ENDPOINT_URL", "").strip()
+        token = os.environ.get("HF_TOKEN", "").strip()
+        if not url or not token:
+            return None
+        return cls(
+            endpoint_url=url.rstrip("/"),
+            token=token,
+            mode=os.environ.get("LLM_ENDPOINT_MODE", "chat").strip().lower() or "chat",
+            model_id=os.environ.get("LLM_MODEL_ID", "tgi").strip() or "tgi",
+            max_new_tokens=int(os.environ.get("LLM_MAX_NEW_TOKENS", "160")),
+            timeout_s=int(os.environ.get("LLM_TIMEOUT_S", "25")),
+        )
+def is_configured() -> bool:
+    """Return True iff env vars required for remote inference are set."""
+    return RemoteLLMConfig.from_env() is not None
+def status_summary() -> Dict[str, Any]:
+    """Lightweight status object for the dashboard to surface."""
+    cfg = RemoteLLMConfig.from_env()
+    if cfg is None:
+        return {
+            "configured": False,
+            "reason": (
+                "Set LLM_ENDPOINT_URL and HF_TOKEN as Space secrets to enable "
+                "the live inference panel."
+            ),
+        }
+    return {
+        "configured": True,
+        "mode": cfg.mode,
+        "model_id": cfg.model_id,
+        "max_new_tokens": cfg.max_new_tokens,
+        # Never surface the token; just confirm it is present.
+        "token_present": bool(cfg.token),
+        # Only expose the host (not the full URL, in case a query-string key
+        # ever leaks into env by accident).
+        "host": _safe_host(cfg.endpoint_url),
+    }
+# ---------------------------------------------------------------------------
+# Internals
+# ---------------------------------------------------------------------------
+def _safe_host(url: str) -> str:
+    try:
+        return url.split("://", 1)[-1].split("/", 1)[0]
+    except Exception:
+        return "(unknown)"
+def _http_post(url: str, headers: Dict[str, str], body: bytes, timeout_s: int) -> str:
+    req = urllib.request.Request(url, data=body, headers=headers, method="POST")
+    try:
+        with urllib.request.urlopen(req, timeout=timeout_s) as resp:
+            return resp.read().decode("utf-8", errors="replace")
+    except urllib.error.HTTPError as exc:
+        raise RuntimeError(
+            f"LLM endpoint returned HTTP {exc.code}: {exc.read().decode('utf-8', errors='replace')[:400]}"
+        ) from exc
+    except (urllib.error.URLError, socket.timeout, TimeoutError) as exc:
+        raise RuntimeError(f"LLM endpoint unreachable: {exc}") from exc
+def _call_chat(cfg: RemoteLLMConfig, prompt: str) -> str:
+    url = f"{cfg.endpoint_url}/v1/chat/completions"
+    payload = {
+        "model": cfg.model_id,
+        "messages": [{"role": "user", "content": prompt}],
+        "temperature": 0.0,
+        "max_tokens": cfg.max_new_tokens,
+        "stream": False,
+    }
+    headers = {
+        "Content-Type": "application/json",
+        "Authorization": f"Bearer {cfg.token}",
+    }
+    raw = _http_post(url, headers, json.dumps(payload).encode("utf-8"), cfg.timeout_s)
+    try:
+        data = json.loads(raw)
+    except json.JSONDecodeError as exc:
+        raise RuntimeError(f"LLM endpoint returned non-JSON: {raw[:400]}") from exc
+    try:
+        return data["choices"][0]["message"]["content"]
+    except (KeyError, IndexError, TypeError) as exc:
+        raise RuntimeError(f"Unexpected chat response shape: {raw[:400]}") from exc
+def _call_generate(cfg: RemoteLLMConfig, prompt: str) -> str:
+    url = f"{cfg.endpoint_url}/generate"
+    payload = {
+        "inputs": prompt,
+        "parameters": {
+            "max_new_tokens": cfg.max_new_tokens,
+            "temperature": 0.0,
+            "do_sample": False,
+            "return_full_text": False,
+        },
+    }
+    headers = {
+        "Content-Type": "application/json",
+        "Authorization": f"Bearer {cfg.token}",
+    }
+    raw = _http_post(url, headers, json.dumps(payload).encode("utf-8"), cfg.timeout_s)
+    try:
+        data = json.loads(raw)
+    except json.JSONDecodeError as exc:
+        raise RuntimeError(f"LLM endpoint returned non-JSON: {raw[:400]}") from exc
+    # TGI returns either {"generated_text": "..."} or a list of such objects.
+    if isinstance(data, list) and data:
+        data = data[0]
+    if isinstance(data, dict) and "generated_text" in data:
+        return str(data["generated_text"])
+    raise RuntimeError(f"Unexpected /generate response shape: {raw[:400]}")
+def generate(prompt: str) -> str:
+    """Send ``prompt`` to the configured remote endpoint and return raw text.
+    Raises RuntimeError with a human-readable message on any failure so the
+    caller (the FastAPI demo endpoint) can surface it in the dashboard.
+    """
+    cfg = RemoteLLMConfig.from_env()
+    if cfg is None:
+        raise RuntimeError(
+            "Remote LLM not configured. Set LLM_ENDPOINT_URL and HF_TOKEN."
+        )
+    _LOG.info("Calling remote LLM %s mode=%s", _safe_host(cfg.endpoint_url), cfg.mode)
+    if cfg.mode == "generate":
+        return _call_generate(cfg, prompt)
+    return _call_chat(cfg, prompt)