SwapnilPatil28 commited on
Commit
540b82c
Β·
verified Β·
1 Parent(s): 3c61da6

Upgrade 1 - Dashboard Update and new Incidents

Browse files
README.md CHANGED
@@ -43,7 +43,7 @@ A **virtual war room** where three specialist agents resolve a live queue of rea
43
  | πŸ§ͺ **Investigator** | Apply a fix Β· roll back a deploy | Escalate or file a post-mortem |
44
  | πŸ‘· **Ops Manager** | Escalate Β· file post-mortem Β· **close the ticket** | Apply a code fix |
45
 
46
- **13 real incidents** Β· **3 difficulty tiers** (easy / medium / hard) Β· **14+ named reward signals** Β· **customer-tier weighting** (enterprise outages cost ~3Γ— a free-tier outage)
47
 
48
  > Wrong actor β†’ **βˆ’0.08**. Wrong root-cause on an enterprise ticket β†’ **βˆ’1.98**. Correct closure on an enterprise ticket β†’ **+1.44**. The rules matter β€” and every step tells you *why* it was scored.
49
 
@@ -661,7 +661,7 @@ Two scripts judges (or you) can run without a local IDE:
661
  β”‚ β”œβ”€β”€ Dockerfile # Production image (HEALTHCHECK included)
662
  β”‚ └── domain/
663
  β”‚ β”œβ”€β”€ __init__.py
664
- β”‚ β”œβ”€β”€ incidents.py # 13 enterprise incident templates + factory
665
  β”‚ β”œβ”€β”€ reward.py # Composable rubric engine (20+ components)
666
  β”‚ β”œβ”€β”€ roles.py # Role-based permission policy
667
  β”‚ └── rng.py # Deterministic per-episode RNG
@@ -697,7 +697,7 @@ ENV_LOG_LEVEL: "INFO"
697
  Full checklist with pre-submission smoke tests β†’ [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md).
698
 
699
  - [x] **OpenEnv latest runtime** and `openenv validate` passing β€” [Space live](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)
700
- - [x] **Multi-agent, long-horizon environment** with role-gated action space (3 roles Γ— 9 actions, 13 incidents)
701
  - [x] **Composable, transparent, anti-gaming reward rubric** (14+ named components, tier-scaled)
702
  - [x] **Business-impact-aware scoring** (customer tier, revenue impact, SLA countdown)
703
  - [x] **End-to-end TRL SFT pipeline** that saves a checkpoint and re-evaluates it in the environment ([`train_trl.py`](./train_trl.py))
 
43
  | πŸ§ͺ **Investigator** | Apply a fix Β· roll back a deploy | Escalate or file a post-mortem |
44
  | πŸ‘· **Ops Manager** | Escalate Β· file post-mortem Β· **close the ticket** | Apply a code fix |
45
 
46
+ **30 unique incident templates** Β· **3 difficulty tiers** (8 easy / 11 medium / 11 hard) Β· **14+ named reward signals** Β· **customer-tier weighting** (enterprise outages cost ~3Γ— a free-tier outage)
47
 
48
  > Wrong actor β†’ **βˆ’0.08**. Wrong root-cause on an enterprise ticket β†’ **βˆ’1.98**. Correct closure on an enterprise ticket β†’ **+1.44**. The rules matter β€” and every step tells you *why* it was scored.
49
 
 
661
  β”‚ β”œβ”€β”€ Dockerfile # Production image (HEALTHCHECK included)
662
  β”‚ └── domain/
663
  β”‚ β”œβ”€β”€ __init__.py
664
+ β”‚ β”œβ”€β”€ incidents.py # 30 enterprise incident templates + factory
665
  β”‚ β”œβ”€β”€ reward.py # Composable rubric engine (20+ components)
666
  β”‚ β”œβ”€β”€ roles.py # Role-based permission policy
667
  β”‚ └── rng.py # Deterministic per-episode RNG
 
697
  Full checklist with pre-submission smoke tests β†’ [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md).
698
 
699
  - [x] **OpenEnv latest runtime** and `openenv validate` passing β€” [Space live](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)
700
+ - [x] **Multi-agent, long-horizon environment** with role-gated action space (3 roles Γ— 9 actions, **30 unique incident templates**)
701
  - [x] **Composable, transparent, anti-gaming reward rubric** (14+ named components, tier-scaled)
702
  - [x] **Business-impact-aware scoring** (customer tier, revenue impact, SLA countdown)
703
  - [x] **End-to-end TRL SFT pipeline** that saves a checkpoint and re-evaluates it in the environment ([`train_trl.py`](./train_trl.py))
docs/BLOG_POST.md CHANGED
@@ -24,7 +24,7 @@
24
 
25
  Each role has **different permissions**, **different information needs**, and a **different clock to beat**. Get it wrong and you bleed budget, bust the SLA, and β€” if the customer is on an enterprise contract β€” lose serious money (~3Γ— what a free-tier outage costs).
26
 
27
- I built a simulator of that war room β€” an **OpenEnv-compatible** environment with 13 realistic incidents, 3 specialist roles, and 14+ named reward signals β€” and fine-tuned an LLM to run it.
28
 
29
  | Role | Can do | Cannot do |
30
  |---|---|---|
@@ -242,7 +242,7 @@ I ran the exact same pipeline with the smaller **Qwen2.5-0.5B-Instruct** backbon
242
  ## 8. What's next
243
 
244
  - **Replace SFT with GRPO or PPO** using the environment's native reward signal β€” no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
245
- - **Scale the incident catalog** from 13 templates to 50+ (drop in JSON-defined scenarios).
246
  - **Add a second "adversarial" agent** that injects misleading signals to test robustness.
247
  - **Record the 2-minute walkthrough** from [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) as a bonus companion to this writeup.
248
 
 
24
 
25
  Each role has **different permissions**, **different information needs**, and a **different clock to beat**. Get it wrong and you bleed budget, bust the SLA, and β€” if the customer is on an enterprise contract β€” lose serious money (~3Γ— what a free-tier outage costs).
26
 
27
+ I built a simulator of that war room β€” an **OpenEnv-compatible** environment with **30 realistic incident templates**, 3 specialist roles, and 14+ named reward signals β€” and fine-tuned an LLM to run it.
28
 
29
  | Role | Can do | Cannot do |
30
  |---|---|---|
 
242
  ## 8. What's next
243
 
244
  - **Replace SFT with GRPO or PPO** using the environment's native reward signal β€” no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
245
+ - **Grow the incident catalog further** (now at 30 templates β€” next stop 50+ via JSON-defined scenarios).
246
  - **Add a second "adversarial" agent** that injects misleading signals to test robustness.
247
  - **Record the 2-minute walkthrough** from [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) as a bonus companion to this writeup.
248
 
docs/SUBMISSION_CHECKLIST.md CHANGED
@@ -26,7 +26,7 @@ Status against every hard gate in the official judging rules, plus every polish
26
  - [x] Multi-role, multi-agent β€” `triage_agent`, `investigator_agent`, `ops_manager_agent` with **non-overlapping permissions** (`server/domain/roles.py`).
27
  - [x] Long-horizon β€” 3–5 sequential incidents per episode, 20–60 steps each, shared SLA + budget counters.
28
  - [x] Professional / enterprise task simulation β€” realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
29
- - [x] 13 unique incident templates across easy / medium / hard (`server/domain/incidents.py`).
30
  - [x] Rich observation schema β€” customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints, `reward_components`, `last_action_notes`.
31
  - [x] Composable reward rubric with **14+ named components** and anti-gaming safeguards (`server/domain/reward.py`).
32
  - [x] Tier-weighted business impact (`free Γ—0.6 Β· standard Γ—1.0 Β· premium Γ—1.4 Β· enterprise Γ—1.8`).
 
26
  - [x] Multi-role, multi-agent β€” `triage_agent`, `investigator_agent`, `ops_manager_agent` with **non-overlapping permissions** (`server/domain/roles.py`).
27
  - [x] Long-horizon β€” 3–5 sequential incidents per episode, 20–60 steps each, shared SLA + budget counters.
28
  - [x] Professional / enterprise task simulation β€” realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
29
+ - [x] **30 unique incident templates** across easy / medium / hard (`server/domain/incidents.py`) β€” 8 easy, 11 medium, 11 hard, covering services (payments, auth, CDN, search, DNS, ML inference, storage, scheduling, messaging, config distribution) and failure modes (OOM, cert expiry, config drift, DNS TTL staleness, rate-limit cascades, GPU fragmentation, cross-region replication lag, DST scheduler bugs, firmware regressions, cache-key tenant collisions).
30
  - [x] Rich observation schema β€” customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints, `reward_components`, `last_action_notes`.
31
  - [x] Composable reward rubric with **14+ named components** and anti-gaming safeguards (`server/domain/reward.py`).
32
  - [x] Tier-weighted business impact (`free Γ—0.6 Β· standard Γ—1.0 Β· premium Γ—1.4 Β· enterprise Γ—1.8`).
server/app.py CHANGED
@@ -38,8 +38,13 @@ from server.domain.reward import (
38
  TIER_MULTIPLIER,
39
  )
40
  from server.environment import IncidentCommandCenterEnvironment
 
41
  from server.logging_utils import configure_logging
42
 
 
 
 
 
43
  _LOG = logging.getLogger("icc.app")
44
  _CONFIG = EnvConfig.from_env()
45
  configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
@@ -173,6 +178,154 @@ async def env_info() -> JSONResponse:
173
  return JSONResponse(_metadata_payload())
174
 
175
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176
  @app.get("/metrics", response_class=PlainTextResponse)
177
  async def metrics() -> PlainTextResponse:
178
  env = _resolve_environment()
@@ -326,6 +479,81 @@ def _dashboard_html() -> str:
326
  # so the existing `{themes_html}` slot renders to nothing (no duplication).
327
  themes_html = ""
328
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
329
  # --- Reward-rubric details ----------------------------------------------
330
  reward_rubric_rows = "".join(
331
  f"<tr><td><code>{name}</code></td><td>{value}</td></tr>"
@@ -402,6 +630,40 @@ def _dashboard_html() -> str:
402
  td.delta.good {{ color: var(--good); }}
403
  .links {{ display:flex; flex-wrap:wrap; gap:0.5rem; }}
404
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
405
  /* "Story in 2 minutes" hero panel β€” plain-English summary for judges. */
406
  .hero-card {{
407
  background: linear-gradient(135deg, #0f2647 0%, #172a4a 60%, #1f2a44 100%);
@@ -477,7 +739,8 @@ def _dashboard_html() -> str:
477
  <h3 style='margin-top:1.25rem'>What is the environment?</h3>
478
  <p class='sub' style='margin:0 0 0.75rem'>
479
  Three specialist agents with <strong>different permissions</strong> resolve
480
- a live queue of 13 realistic tech incidents across 3 difficulty tiers.
 
481
  </p>
482
  <div class='table-wrap'>
483
  <table>
@@ -684,6 +947,8 @@ def _dashboard_html() -> str:
684
 
685
  {ablation_html}
686
 
 
 
687
  {themes_html}
688
 
689
  <h2>Endpoints</h2>
@@ -763,6 +1028,68 @@ def _dashboard_html() -> str:
763
  const total = Object.values(data.incidents_per_task || {{}}).reduce((a,b)=>a+b,0);
764
  document.getElementById('kpi-inc').textContent = total;
765
  }} catch (e) {{}}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
766
  </script>
767
  </body>
768
  </html>
 
38
  TIER_MULTIPLIER,
39
  )
40
  from server.environment import IncidentCommandCenterEnvironment
41
+ from server import llm_remote
42
  from server.logging_utils import configure_logging
43
 
44
+ import re as _re
45
+
46
+ _JSON_RE = _re.compile(r"\{[\s\S]*\}")
47
+
48
  _LOG = logging.getLogger("icc.app")
49
  _CONFIG = EnvConfig.from_env()
50
  configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
 
178
  return JSONResponse(_metadata_payload())
179
 
180
 
181
+ # ---------------------------------------------------------------------------
182
+ # Live LLM inference demo (optional β€” only enabled when HF credentials set)
183
+ # ---------------------------------------------------------------------------
184
+
185
+
186
+ def _build_demo_prompt(obs: IncidentObservation) -> str:
187
+ """Same prompt format the SFT model was fine-tuned on (train_trl.obs_to_prompt)."""
188
+ targets = obs.investigation_targets or {}
189
+ return (
190
+ "You are operating a multi-agent incident command center. "
191
+ "Pick the next action for the appropriate specialist role.\n\n"
192
+ f"Incident ID: {obs.incident_id}\n"
193
+ f"Title: {obs.incident_title}\n"
194
+ f"Description: {obs.incident_description}\n"
195
+ f"Customer tier: {obs.customer_tier} | "
196
+ f"Affected users: {obs.affected_users_estimate} | "
197
+ f"Revenue impact (USD/min): {obs.revenue_impact_usd_per_min}\n"
198
+ f"Postmortem required: {obs.postmortem_required}\n"
199
+ f"Visible signals: {', '.join(obs.visible_signals or [])}\n"
200
+ f"Available log targets: {', '.join(targets.get('logs', []) or [])}\n"
201
+ f"Available metric targets: {', '.join(targets.get('metrics', []) or [])}\n"
202
+ f"Available KB articles: {', '.join(targets.get('kb', []) or [])}\n"
203
+ f"Budget remaining: {obs.budget_remaining} actions | "
204
+ f"SLA remaining: {obs.sla_minutes_remaining} min | "
205
+ f"Clues found: {obs.clues_found} | "
206
+ f"Mitigation applied: {obs.mitigation_applied}\n"
207
+ f"Last terminal output: {obs.terminal_output}\n\n"
208
+ "Respond with a JSON object containing exactly these keys: "
209
+ "actor, action_type, target, root_cause, resolution_summary, "
210
+ "postmortem_note, confidence, reason."
211
+ )
212
+
213
+
214
+ def _parse_llm_action(response_text: str) -> Dict[str, Any]:
215
+ """Extract the first balanced JSON object from a model response."""
216
+ match = _JSON_RE.search(response_text or "")
217
+ if not match:
218
+ return {}
219
+ raw = match.group(0)
220
+ last_close = raw.rfind("}")
221
+ if last_close != -1:
222
+ raw = raw[: last_close + 1]
223
+ try:
224
+ return json.loads(raw)
225
+ except (json.JSONDecodeError, TypeError):
226
+ return {}
227
+
228
+
229
+ @app.get("/llm-demo-status", response_class=JSONResponse)
230
+ async def llm_demo_status() -> JSONResponse:
231
+ """Report whether the live-inference panel is usable (credentials set)."""
232
+ return JSONResponse(llm_remote.status_summary())
233
+
234
+
235
+ @app.post("/llm-demo", response_class=JSONResponse)
236
+ async def llm_demo(payload: Dict[str, Any]) -> JSONResponse:
237
+ """Run one live step against the fine-tuned model behind an HF endpoint.
238
+
239
+ Spins up a fresh isolated ``IncidentCommandCenterEnvironment`` for each
240
+ call so the demo never disturbs the main environment instance that is
241
+ answering ``/reset`` and ``/step`` for training clients. Returns the full
242
+ trace (observation β†’ prompt β†’ raw LLM text β†’ parsed action β†’ reward) so
243
+ judges can see exactly what the model produced.
244
+ """
245
+ if not llm_remote.is_configured():
246
+ return JSONResponse(
247
+ {
248
+ "error": "Remote LLM not configured on this Space.",
249
+ "status": llm_remote.status_summary(),
250
+ },
251
+ status_code=503,
252
+ )
253
+
254
+ task_name = str(payload.get("task_name") or "easy").strip()
255
+ try:
256
+ seed = int(payload.get("seed") or _CONFIG.default_seed)
257
+ except (TypeError, ValueError):
258
+ seed = _CONFIG.default_seed
259
+
260
+ # Isolated env so the live demo never clobbers the shared state.
261
+ env = IncidentCommandCenterEnvironment()
262
+ obs = env.reset(task_name=task_name, seed=seed)
263
+ prompt = _build_demo_prompt(obs)
264
+
265
+ try:
266
+ raw_response = llm_remote.generate(prompt)
267
+ except Exception as exc: # pragma: no cover - network-dependent
268
+ return JSONResponse(
269
+ {
270
+ "error": f"Remote LLM call failed: {exc}",
271
+ "status": llm_remote.status_summary(),
272
+ },
273
+ status_code=502,
274
+ )
275
+
276
+ parsed_action_dict = _parse_llm_action(raw_response)
277
+
278
+ try:
279
+ action = IncidentAction(**parsed_action_dict)
280
+ parsed_ok = True
281
+ except Exception:
282
+ logs = (obs.investigation_targets or {}).get("logs", []) or []
283
+ fallback_target = logs[0] if logs else "payments-api"
284
+ action = IncidentAction(
285
+ actor="triage_agent",
286
+ action_type="inspect_logs",
287
+ target=fallback_target,
288
+ reason="Fallback (LLM JSON invalid).",
289
+ )
290
+ parsed_ok = False
291
+
292
+ step_obs = env.step(action)
293
+ reward_components = dict(step_obs.reward_components or {})
294
+ reward_total = sum(reward_components.values()) if reward_components else 0.0
295
+
296
+ return JSONResponse(
297
+ {
298
+ "task_name": task_name,
299
+ "seed": seed,
300
+ "observation_before": {
301
+ "incident_id": obs.incident_id,
302
+ "incident_title": obs.incident_title,
303
+ "customer_tier": obs.customer_tier,
304
+ "affected_users_estimate": obs.affected_users_estimate,
305
+ "revenue_impact_usd_per_min": obs.revenue_impact_usd_per_min,
306
+ "visible_signals": obs.visible_signals,
307
+ "investigation_targets": obs.investigation_targets,
308
+ "budget_remaining": obs.budget_remaining,
309
+ "sla_minutes_remaining": obs.sla_minutes_remaining,
310
+ },
311
+ "prompt": prompt,
312
+ "raw_llm_response": raw_response,
313
+ "parsed_action": parsed_action_dict,
314
+ "validated_action": action.model_dump(exclude_none=True),
315
+ "fallback_used": not parsed_ok,
316
+ "step_result": {
317
+ "reward_total": round(reward_total, 4),
318
+ "reward_components": {
319
+ k: round(v, 4) for k, v in reward_components.items()
320
+ },
321
+ "done": bool(step_obs.done),
322
+ "terminal_output": step_obs.terminal_output,
323
+ "last_action_notes": list(step_obs.last_action_notes or []),
324
+ },
325
+ }
326
+ )
327
+
328
+
329
  @app.get("/metrics", response_class=PlainTextResponse)
330
  async def metrics() -> PlainTextResponse:
331
  env = _resolve_environment()
 
479
  # so the existing `{themes_html}` slot renders to nothing (no duplication).
480
  themes_html = ""
481
 
482
+ # --- Live inference panel (only shown when HF credentials set) ----------
483
+ llm_status = llm_remote.status_summary()
484
+ if llm_status.get("configured"):
485
+ live_panel_html = f"""
486
+ <h2>Try the fine-tuned model live</h2>
487
+ <div class='card'>
488
+ <p class='sub'>
489
+ Spin up an isolated episode and watch the <strong>fine-tuned SFT model</strong>
490
+ pick the next action in real time. The prompt below is the exact format
491
+ used during training, so you can see how the model transforms a raw
492
+ observation into a typed <code>IncidentAction</code> β€” and the
493
+ environment's reward response.
494
+ </p>
495
+ <div class='live-controls'>
496
+ <label>Task
497
+ <select id='live-task'>
498
+ <option value='easy'>easy</option>
499
+ <option value='medium'>medium</option>
500
+ <option value='hard' selected>hard</option>
501
+ </select>
502
+ </label>
503
+ <label>Seed
504
+ <input id='live-seed' type='number' value='42' min='0' step='1' />
505
+ </label>
506
+ <button id='live-run' class='pill cta'>β–Ά Run one step</button>
507
+ <span id='live-status' class='sub'>Endpoint: {llm_status.get('host', 'β€”')} Β· mode: {llm_status.get('mode', 'chat')}</span>
508
+ </div>
509
+ <div id='live-output' class='live-output' hidden>
510
+ <div class='live-grid'>
511
+ <div>
512
+ <h4>Observation (before)</h4>
513
+ <pre id='live-obs-before'></pre>
514
+ </div>
515
+ <div>
516
+ <h4>Prompt sent to model</h4>
517
+ <pre id='live-prompt'></pre>
518
+ </div>
519
+ <div>
520
+ <h4>Raw LLM response</h4>
521
+ <pre id='live-raw'></pre>
522
+ </div>
523
+ <div>
524
+ <h4>Parsed &amp; validated action</h4>
525
+ <pre id='live-action'></pre>
526
+ </div>
527
+ <div class='live-grid-full'>
528
+ <h4>Environment step result</h4>
529
+ <pre id='live-step'></pre>
530
+ </div>
531
+ </div>
532
+ </div>
533
+ <div id='live-error' class='live-error' hidden></div>
534
+ </div>
535
+ """
536
+ else:
537
+ live_panel_html = f"""
538
+ <h2>Try the fine-tuned model live</h2>
539
+ <div class='card'>
540
+ <p class='sub'>
541
+ <strong>Optional bonus panel.</strong> This Space can stream the
542
+ fine-tuned SFT model's decisions in real time when a Hugging Face
543
+ Inference Endpoint is attached. {llm_status.get('reason', '')}
544
+ </p>
545
+ <details>
546
+ <summary class='sub'>How the owner enables it</summary>
547
+ <ol>
548
+ <li>Upload the SFT checkpoint from <code>artifacts/sft_model/</code> to a model repo on the Hub.</li>
549
+ <li>Create a dedicated <a href='https://huggingface.co/inference-endpoints' target='_blank' rel='noopener'>Inference Endpoint</a> (T4 small is enough).</li>
550
+ <li>Set <code>LLM_ENDPOINT_URL</code> and <code>HF_TOKEN</code> as secrets on this Space.</li>
551
+ <li>Restart the Space β€” this panel turns on automatically.</li>
552
+ </ol>
553
+ </details>
554
+ </div>
555
+ """
556
+
557
  # --- Reward-rubric details ----------------------------------------------
558
  reward_rubric_rows = "".join(
559
  f"<tr><td><code>{name}</code></td><td>{value}</td></tr>"
 
630
  td.delta.good {{ color: var(--good); }}
631
  .links {{ display:flex; flex-wrap:wrap; gap:0.5rem; }}
632
 
633
+ /* Live-inference panel (fine-tuned SFT model behind HF Inference Endpoint). */
634
+ .live-controls {{
635
+ display:flex; flex-wrap:wrap; gap:1rem; align-items:center;
636
+ margin:0.75rem 0 1rem;
637
+ }}
638
+ .live-controls label {{
639
+ display:flex; flex-direction:column; gap:0.2rem;
640
+ font-size:0.8rem; color:var(--muted);
641
+ }}
642
+ .live-controls select, .live-controls input {{
643
+ background:#0b1225; border:1px solid #1f2a44; color:var(--text);
644
+ border-radius:8px; padding:0.35rem 0.55rem; font-size:0.9rem; min-width:110px;
645
+ }}
646
+ .live-controls button.pill.cta {{ cursor:pointer; border:0; }}
647
+ .live-controls button.pill.cta:disabled {{ opacity:0.6; cursor:wait; }}
648
+ .live-grid {{
649
+ display:grid; grid-template-columns: repeat(auto-fit, minmax(360px, 1fr));
650
+ gap:0.9rem; margin-top:0.5rem;
651
+ }}
652
+ .live-grid h4 {{
653
+ margin:0 0 0.3rem; font-size:0.85rem; color:#cbd5e1;
654
+ text-transform:uppercase; letter-spacing:0.04em;
655
+ }}
656
+ .live-grid .live-grid-full {{ grid-column: 1 / -1; }}
657
+ .live-grid pre {{
658
+ background:#0b1225; border:1px solid #1f2a44; border-radius:10px;
659
+ padding:0.75rem; margin:0; font-size:0.82rem; line-height:1.45;
660
+ max-height:320px; overflow:auto; white-space:pre-wrap; word-wrap:break-word;
661
+ }}
662
+ .live-error {{
663
+ background:#2a1418; border:1px solid #ef444455; color:#fca5a5;
664
+ border-radius:10px; padding:0.75rem; margin-top:0.75rem; font-size:0.9rem;
665
+ }}
666
+
667
  /* "Story in 2 minutes" hero panel β€” plain-English summary for judges. */
668
  .hero-card {{
669
  background: linear-gradient(135deg, #0f2647 0%, #172a4a 60%, #1f2a44 100%);
 
739
  <h3 style='margin-top:1.25rem'>What is the environment?</h3>
740
  <p class='sub' style='margin:0 0 0.75rem'>
741
  Three specialist agents with <strong>different permissions</strong> resolve
742
+ a live queue drawn from <strong>30 realistic tech incident templates</strong>
743
+ across 3 difficulty tiers.
744
  </p>
745
  <div class='table-wrap'>
746
  <table>
 
947
 
948
  {ablation_html}
949
 
950
+ {live_panel_html}
951
+
952
  {themes_html}
953
 
954
  <h2>Endpoints</h2>
 
1028
  const total = Object.values(data.incidents_per_task || {{}}).reduce((a,b)=>a+b,0);
1029
  document.getElementById('kpi-inc').textContent = total;
1030
  }} catch (e) {{}}
1031
+
1032
+ // Live fine-tuned-model demo. Only runs if the panel is rendered.
1033
+ (function() {{
1034
+ const runBtn = document.getElementById('live-run');
1035
+ if (!runBtn) return;
1036
+
1037
+ const taskSel = document.getElementById('live-task');
1038
+ const seedInp = document.getElementById('live-seed');
1039
+ const out = document.getElementById('live-output');
1040
+ const err = document.getElementById('live-error');
1041
+ const obsPre = document.getElementById('live-obs-before');
1042
+ const promptPre = document.getElementById('live-prompt');
1043
+ const rawPre = document.getElementById('live-raw');
1044
+ const actPre = document.getElementById('live-action');
1045
+ const stepPre = document.getElementById('live-step');
1046
+
1047
+ function showError(msg) {{
1048
+ err.textContent = msg;
1049
+ err.hidden = false;
1050
+ out.hidden = true;
1051
+ }}
1052
+
1053
+ function renderOutput(data) {{
1054
+ err.hidden = true;
1055
+ obsPre.textContent = JSON.stringify(data.observation_before || {{}}, null, 2);
1056
+ promptPre.textContent = data.prompt || '';
1057
+ rawPre.textContent = data.raw_llm_response || '(empty response)';
1058
+ const fallbackTag = data.fallback_used
1059
+ ? '// NOTE: LLM JSON was invalid β€” safe fallback action was used instead.\\n'
1060
+ : '';
1061
+ actPre.textContent = fallbackTag + JSON.stringify(data.validated_action || {{}}, null, 2);
1062
+ stepPre.textContent = JSON.stringify(data.step_result || {{}}, null, 2);
1063
+ out.hidden = false;
1064
+ }}
1065
+
1066
+ runBtn.addEventListener('click', async () => {{
1067
+ runBtn.disabled = true;
1068
+ const label = runBtn.textContent;
1069
+ runBtn.textContent = '⏳ Calling model…';
1070
+ try {{
1071
+ const resp = await fetch('/llm-demo', {{
1072
+ method: 'POST',
1073
+ headers: {{'Content-Type': 'application/json'}},
1074
+ body: JSON.stringify({{
1075
+ task_name: taskSel.value,
1076
+ seed: Number(seedInp.value) || 0
1077
+ }})
1078
+ }});
1079
+ const data = await resp.json();
1080
+ if (!resp.ok) {{
1081
+ showError((data && data.error) ? data.error : ('HTTP ' + resp.status));
1082
+ }} else {{
1083
+ renderOutput(data);
1084
+ }}
1085
+ }} catch (e) {{
1086
+ showError('Network error: ' + e.message);
1087
+ }} finally {{
1088
+ runBtn.disabled = false;
1089
+ runBtn.textContent = label;
1090
+ }}
1091
+ }});
1092
+ }})();
1093
  </script>
1094
  </body>
1095
  </html>
server/domain/incidents.py CHANGED
@@ -850,17 +850,885 @@ def _deadlock_database() -> IncidentTemplate:
850
  )
851
 
852
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
853
  def build_incident_library() -> IncidentLibrary:
854
- """Return the built-in enterprise incident library."""
855
  return IncidentLibrary(
856
  templates_by_task={
857
- "easy": [_redis_pool(), _jwt_clock_skew(), _email_spam_false_positive()],
 
 
 
 
 
 
 
 
 
858
  "medium": [
859
  _cache_invalidation_lag(),
860
  _tz_normalization(),
861
  _invoice_idempotency(),
862
  _tls_expiry(),
863
  _feature_flag_rollout(),
 
 
 
 
 
 
864
  ],
865
  "hard": [
866
  _promo_rate_cascade(),
@@ -868,6 +1736,12 @@ def build_incident_library() -> IncidentLibrary:
868
  _alert_storm(),
869
  _inventory_race(),
870
  _deadlock_database(),
 
 
 
 
 
 
871
  ],
872
  }
873
  )
 
850
  )
851
 
852
 
853
+ # ---------------------------------------------------------------------------
854
+ # Extended catalog (round-2 polish)
855
+ #
856
+ # 17 additional templates balance the tier mix (free / standard / premium /
857
+ # enterprise), add new service dimensions (DNS, CDN, ML inference, storage,
858
+ # message queue, config distribution) and new failure modes (GPU memory leaks,
859
+ # replication saturation, cache key collisions, firmware regressions, DST
860
+ # bugs). Each template follows the same pattern as INC-E1..H5 so the reward
861
+ # rubric, environment plumbing and training scripts require no changes.
862
+ # ---------------------------------------------------------------------------
863
+
864
+
865
+ def _dns_ttl_stale() -> IncidentTemplate:
866
+ return IncidentTemplate(
867
+ id="INC-E4",
868
+ title="Stale DNS routes free-tier API traffic to drained region",
869
+ description=(
870
+ "Free-tier API callers keep hitting a drained region even after "
871
+ "a planned failover because DNS TTLs have not expired."
872
+ ),
873
+ category="networking",
874
+ difficulty="easy",
875
+ root_cause="dns_ttl_stale_after_failover",
876
+ root_cause_synonyms=(
877
+ "dns ttl stale after failover",
878
+ "stale dns record",
879
+ "long ttl blocking failover",
880
+ ),
881
+ clue_keywords=("dns", "ttl", "failover", "drain"),
882
+ signals=(
883
+ "Traffic ratio to drained region stays above 30% 30 minutes post-failover",
884
+ "Only free-tier resolvers (no Anycast) are affected",
885
+ ),
886
+ logs={
887
+ "dns-edge": "A record TTL=3600s still cached at regional resolvers",
888
+ "traffic-router": "Residual traffic observed on drained region us-west-2b",
889
+ },
890
+ red_herring_logs={
891
+ "payments-api": "steady 2xx",
892
+ },
893
+ metrics={
894
+ "dash-dns": "ttl_expired_ratio 0.71 (expected >0.95)",
895
+ "dash-router": "drained_region_share 34%",
896
+ },
897
+ red_herring_metrics={
898
+ "dash-cdn": "hit_ratio 95%",
899
+ },
900
+ kb={
901
+ "kb-dns-ttl": "Pre-lower TTL to 60s at least 2 TTLs before planned failovers.",
902
+ },
903
+ good_handoff="triage_agent",
904
+ accepted_fix_keywords=(
905
+ ("shorten", "dns", "ttl"),
906
+ ("force", "resolver", "refresh"),
907
+ ("rollback", "region", "drain"),
908
+ ),
909
+ required_investigations=1,
910
+ customer_tier="free",
911
+ affected_users_estimate=2_500,
912
+ revenue_impact_usd_per_min=15,
913
+ requires_mitigation=True,
914
+ )
915
+
916
+
917
+ def _cdn_purge_scope() -> IncidentTemplate:
918
+ return IncidentTemplate(
919
+ id="INC-E5",
920
+ title="CDN purge missed a hot asset after release",
921
+ description=(
922
+ "A marketing banner refresh missed a subset of CDN edges, so a "
923
+ "fraction of standard-tier users see the old creative."
924
+ ),
925
+ category="cdn",
926
+ difficulty="easy",
927
+ root_cause="cdn_purge_scope_mismatch",
928
+ root_cause_synonyms=(
929
+ "cdn purge scope mismatch",
930
+ "edge purge partial",
931
+ "shield purge missed",
932
+ ),
933
+ clue_keywords=("cdn", "purge", "edge", "shield"),
934
+ signals=(
935
+ "Small but persistent share of stale banner impressions",
936
+ "Affected edges cluster on a single PoP provider",
937
+ ),
938
+ logs={
939
+ "cdn-control-plane": "Purge job completed with 14 edges skipped (policy=legacy)",
940
+ "edge-pop-bom-1": "Serving banner_v12 while origin is on banner_v13",
941
+ },
942
+ metrics={
943
+ "dash-cdn": "stale_object_rate 1.4%, edge_sync_lag_s 312",
944
+ },
945
+ red_herring_metrics={
946
+ "dash-auth": "401_rate 0.2%",
947
+ },
948
+ kb={
949
+ "kb-cdn-purge": "Always use wildcard purge with full edge fanout for visual assets.",
950
+ },
951
+ good_handoff="investigator_agent",
952
+ accepted_fix_keywords=(
953
+ ("reissue", "cdn", "purge"),
954
+ ("fanout", "edge", "invalidation"),
955
+ ("rotate", "asset", "hash"),
956
+ ),
957
+ required_investigations=1,
958
+ customer_tier="standard",
959
+ affected_users_estimate=11_000,
960
+ revenue_impact_usd_per_min=60,
961
+ requires_mitigation=True,
962
+ )
963
+
964
+
965
+ def _autocomplete_stale() -> IncidentTemplate:
966
+ return IncidentTemplate(
967
+ id="INC-E6",
968
+ title="Search autocomplete missing this week's products",
969
+ description=(
970
+ "Free-tier shoppers see a stale autocomplete list that does not "
971
+ "surface new SKUs released this Monday."
972
+ ),
973
+ category="search",
974
+ difficulty="easy",
975
+ root_cause="autocomplete_index_rebuild_skipped",
976
+ root_cause_synonyms=(
977
+ "autocomplete index rebuild skipped",
978
+ "suggestion index stale",
979
+ "nightly reindex missed",
980
+ ),
981
+ clue_keywords=("autocomplete", "index", "reindex", "suggestion"),
982
+ signals=(
983
+ "New SKUs launched Monday never appear in suggest responses",
984
+ "Full text search returns them correctly",
985
+ ),
986
+ logs={
987
+ "suggest-indexer": "Scheduled rebuild skipped (upstream lock held)",
988
+ "suggest-api": "Serving snapshot v88 (expected v91)",
989
+ },
990
+ red_herring_logs={
991
+ "payments-api": "steady 2xx",
992
+ },
993
+ metrics={
994
+ "dash-suggest": "index_version 88, target_version 91",
995
+ "dash-search": "full_text_recall 99%, autocomplete_recall 71%",
996
+ },
997
+ kb={
998
+ "kb-autocomplete": "Reindex lock must release on job exit and alert on missed window.",
999
+ },
1000
+ good_handoff="ops_manager_agent",
1001
+ accepted_fix_keywords=(
1002
+ ("force", "index", "rebuild"),
1003
+ ("release", "reindex", "lock"),
1004
+ ("promote", "suggestion", "snapshot"),
1005
+ ),
1006
+ required_investigations=1,
1007
+ customer_tier="free",
1008
+ affected_users_estimate=18_000,
1009
+ revenue_impact_usd_per_min=30,
1010
+ requires_mitigation=True,
1011
+ )
1012
+
1013
+
1014
+ def _webhook_retry_budget() -> IncidentTemplate:
1015
+ return IncidentTemplate(
1016
+ id="INC-E7",
1017
+ title="Partner webhooks silently dropping",
1018
+ description=(
1019
+ "A handful of partner integrations stopped receiving webhook "
1020
+ "deliveries after a downstream 429 spike."
1021
+ ),
1022
+ category="integrations",
1023
+ difficulty="easy",
1024
+ root_cause="webhook_retry_budget_exhausted",
1025
+ root_cause_synonyms=(
1026
+ "webhook retry budget exhausted",
1027
+ "partner webhook giving up",
1028
+ "429 retry exhaustion",
1029
+ ),
1030
+ clue_keywords=("webhook", "retry", "429", "budget"),
1031
+ signals=(
1032
+ "Deliveries succeed for some partners and silently fail for others",
1033
+ "Affected partners all share a single rate-limit bucket",
1034
+ ),
1035
+ logs={
1036
+ "webhook-dispatcher": "Retry budget exhausted for partner_bucket=bucket-7",
1037
+ "partner-gateway": "HTTP 429 for 22 consecutive attempts on bucket-7",
1038
+ },
1039
+ red_herring_logs={
1040
+ "catalog-api": "steady 2xx",
1041
+ },
1042
+ metrics={
1043
+ "dash-webhooks": "delivery_success_bucket7 34%, retry_budget_remaining 0",
1044
+ },
1045
+ kb={
1046
+ "kb-webhook-retry": "Split rate-limit buckets per partner and reset retry budgets on recovery.",
1047
+ },
1048
+ good_handoff="ops_manager_agent",
1049
+ accepted_fix_keywords=(
1050
+ ("split", "retry", "bucket"),
1051
+ ("reset", "retry", "budget"),
1052
+ ("pause", "partner", "bucket"),
1053
+ ),
1054
+ required_investigations=2,
1055
+ customer_tier="standard",
1056
+ affected_users_estimate=1_400,
1057
+ revenue_impact_usd_per_min=80,
1058
+ requires_mitigation=True,
1059
+ )
1060
+
1061
+
1062
+ def _thumbnail_worker_oom() -> IncidentTemplate:
1063
+ return IncidentTemplate(
1064
+ id="INC-E8",
1065
+ title="User profile thumbnails render blank on mobile",
1066
+ description=(
1067
+ "Free-tier mobile users see empty circles where their profile "
1068
+ "photo should appear, intermittently."
1069
+ ),
1070
+ category="media",
1071
+ difficulty="easy",
1072
+ root_cause="thumbnail_worker_oom_killed",
1073
+ root_cause_synonyms=(
1074
+ "thumbnail worker oom killed",
1075
+ "image worker out of memory",
1076
+ "thumbnailer oom loop",
1077
+ ),
1078
+ clue_keywords=("thumbnail", "oom", "memory", "worker"),
1079
+ signals=(
1080
+ "Missing thumbnails correlate with HEIC uploads from newer devices",
1081
+ "CPU is normal but worker restart count is spiking",
1082
+ ),
1083
+ logs={
1084
+ "thumbnail-worker": "SIGKILL received (oom_score_adj=500)",
1085
+ "image-pipeline": "HEIC decoder peak rss 1.9GB on large uploads",
1086
+ },
1087
+ metrics={
1088
+ "dash-thumbnails": "render_success 82%, worker_restarts 240/hr",
1089
+ "dash-k8s": "pod_oom_kill_count 42",
1090
+ },
1091
+ kb={
1092
+ "kb-thumbnail": "Cap HEIC decode memory or reject above 30MP at the edge.",
1093
+ },
1094
+ good_handoff="triage_agent",
1095
+ accepted_fix_keywords=(
1096
+ ("raise", "memory", "limit"),
1097
+ ("reject", "oversized", "heic"),
1098
+ ("downscale", "before", "decode"),
1099
+ ),
1100
+ required_investigations=2,
1101
+ customer_tier="free",
1102
+ affected_users_estimate=55_000,
1103
+ revenue_impact_usd_per_min=20,
1104
+ requires_mitigation=True,
1105
+ )
1106
+
1107
+
1108
+ def _recommender_heap_leak() -> IncidentTemplate:
1109
+ return IncidentTemplate(
1110
+ id="INC-M6",
1111
+ title="Recommender latency drifts up after model swap",
1112
+ description=(
1113
+ "Homepage recommendation latency is drifting up over six hours "
1114
+ "since this morning's model swap. p99 is now 2.1s."
1115
+ ),
1116
+ category="recommendations",
1117
+ difficulty="medium",
1118
+ root_cause="recommender_heap_leak_after_model_swap",
1119
+ root_cause_synonyms=(
1120
+ "recommender heap leak after model swap",
1121
+ "embedding cache not released",
1122
+ "old model tensors pinned",
1123
+ ),
1124
+ clue_keywords=("heap", "leak", "embedding", "model", "swap"),
1125
+ signals=(
1126
+ "Heap utilisation climbs 2% / hour since deploy",
1127
+ "Full GC frequency doubled but does not recover memory",
1128
+ ),
1129
+ logs={
1130
+ "recommender-service": "Loaded model v42; previous tensors not released",
1131
+ "jvm-gc": "Old gen occupancy 88% after full GC",
1132
+ },
1133
+ red_herring_logs={
1134
+ "catalog-api": "steady 2xx",
1135
+ },
1136
+ metrics={
1137
+ "dash-recommender": "p99_latency_ms 2100, heap_used_pct 88",
1138
+ "dash-jvm": "full_gc_per_min 4, reclaimed_bytes_low",
1139
+ },
1140
+ red_herring_metrics={
1141
+ "dash-search": "ctr steady",
1142
+ },
1143
+ kb={
1144
+ "kb-model-swap": "Release previous model tensors explicitly before binding the new one.",
1145
+ },
1146
+ good_handoff="investigator_agent",
1147
+ accepted_fix_keywords=(
1148
+ ("release", "previous", "model"),
1149
+ ("unload", "embedding", "cache"),
1150
+ ("rollback", "model", "swap"),
1151
+ ),
1152
+ required_investigations=2,
1153
+ customer_tier="premium",
1154
+ affected_users_estimate=95_000,
1155
+ revenue_impact_usd_per_min=410,
1156
+ requires_mitigation=True,
1157
+ )
1158
+
1159
+
1160
+ def _consumer_group_rebalance() -> IncidentTemplate:
1161
+ return IncidentTemplate(
1162
+ id="INC-M7",
1163
+ title="Order events stuck behind consumer rebalance storm",
1164
+ description=(
1165
+ "Order processing lag spiked after a rolling restart and has not "
1166
+ "recovered; fresh orders are 90s behind real time."
1167
+ ),
1168
+ category="messaging",
1169
+ difficulty="medium",
1170
+ root_cause="consumer_group_rebalance_storm",
1171
+ root_cause_synonyms=(
1172
+ "consumer group rebalance storm",
1173
+ "kafka consumer thrashing",
1174
+ "repeated partition reassignment",
1175
+ ),
1176
+ clue_keywords=("kafka", "consumer", "rebalance", "partition"),
1177
+ signals=(
1178
+ "Consumer group rebalanced 11 times in 5 minutes",
1179
+ "Lag stuck even though CPU is at 30%",
1180
+ ),
1181
+ logs={
1182
+ "order-consumer": "Rebalance triggered: member id rotated, session timeout=10s",
1183
+ "kafka-coordinator": "Generation 412 -> 423 in 5m, partitions churning",
1184
+ },
1185
+ red_herring_logs={
1186
+ "auth-service": "normal 2xx",
1187
+ },
1188
+ metrics={
1189
+ "dash-orders": "consumer_lag 90s, rebalance_count_5m 11",
1190
+ "dash-kafka": "generation_rotations 2.2/min",
1191
+ },
1192
+ kb={
1193
+ "kb-consumer-tuning": "Raise session.timeout.ms and heartbeat.interval.ms to avoid false expulsion.",
1194
+ },
1195
+ good_handoff="ops_manager_agent",
1196
+ accepted_fix_keywords=(
1197
+ ("raise", "session", "timeout"),
1198
+ ("pin", "static", "membership"),
1199
+ ("stabilise", "consumer", "group"),
1200
+ ),
1201
+ required_investigations=2,
1202
+ customer_tier="premium",
1203
+ affected_users_estimate=48_000,
1204
+ revenue_impact_usd_per_min=520,
1205
+ requires_mitigation=True,
1206
+ )
1207
+
1208
+
1209
+ def _config_push_skipped_canary() -> IncidentTemplate:
1210
+ return IncidentTemplate(
1211
+ id="INC-M8",
1212
+ title="Enterprise tenants hit TLS verify failures after config push",
1213
+ description=(
1214
+ "A global config change flipped a TLS verification flag in "
1215
+ "production without going through canary."
1216
+ ),
1217
+ category="platform",
1218
+ difficulty="medium",
1219
+ root_cause="config_push_skipped_canary",
1220
+ root_cause_synonyms=(
1221
+ "config push skipped canary",
1222
+ "global config bypassed stage",
1223
+ "bulk config rollout regression",
1224
+ ),
1225
+ clue_keywords=("config", "canary", "push", "rollout"),
1226
+ signals=(
1227
+ "Enterprise tenants see TLS verify errors 3 minutes after deploy",
1228
+ "Canary stage shows zero traffic for this change",
1229
+ ),
1230
+ logs={
1231
+ "config-service": "Changeset CR-8812 applied globally (stages=[])",
1232
+ "api-gateway": "TLS verify flag=strict caused downstream handshake failures",
1233
+ },
1234
+ red_herring_logs={
1235
+ "email-service": "no anomalies",
1236
+ },
1237
+ metrics={
1238
+ "dash-config": "canary_coverage 0%, rollout_surface 100%",
1239
+ "dash-gateway": "tls_verify_failures 8.3%",
1240
+ },
1241
+ kb={
1242
+ "kb-config-rollout": "Require canary + 15 minutes bake before promoting config changes.",
1243
+ },
1244
+ good_handoff="ops_manager_agent",
1245
+ accepted_fix_keywords=(
1246
+ ("rollback", "config", "change"),
1247
+ ("re-enable", "canary", "stage"),
1248
+ ("revert", "tls", "flag"),
1249
+ ),
1250
+ required_investigations=2,
1251
+ customer_tier="enterprise",
1252
+ affected_users_estimate=2_100,
1253
+ revenue_impact_usd_per_min=640,
1254
+ requires_mitigation=True,
1255
+ postmortem_required=True,
1256
+ )
1257
+
1258
+
1259
+ def _health_check_flapping() -> IncidentTemplate:
1260
+ return IncidentTemplate(
1261
+ id="INC-M9",
1262
+ title="Autoscaler thrashing under brief latency blips",
1263
+ description=(
1264
+ "Autoscaler is adding and removing pods every 2 minutes in "
1265
+ "response to very short latency blips."
1266
+ ),
1267
+ category="platform",
1268
+ difficulty="medium",
1269
+ root_cause="health_check_timeout_too_aggressive",
1270
+ root_cause_synonyms=(
1271
+ "health check timeout too aggressive",
1272
+ "liveness probe too tight",
1273
+ "autoscaler oscillating",
1274
+ ),
1275
+ clue_keywords=("health", "check", "liveness", "autoscaler"),
1276
+ signals=(
1277
+ "Pod churn 6x baseline with no underlying load change",
1278
+ "Brief p99 blips align with scale events, not incidents",
1279
+ ),
1280
+ logs={
1281
+ "kubelet": "Liveness probe failed: HTTP 500 after 800ms",
1282
+ "autoscaler": "Scale up triggered; 3 pods added, 2 removed within 2m",
1283
+ },
1284
+ red_herring_logs={
1285
+ "payments-api": "steady 2xx",
1286
+ },
1287
+ metrics={
1288
+ "dash-k8s": "pod_churn_per_min 9, cpu_avg 42%",
1289
+ "dash-slo": "p99_latency_ms spikes tied to scale events",
1290
+ },
1291
+ kb={
1292
+ "kb-health-probe": "Raise liveness timeout and stagger readiness to avoid flap-driven scale events.",
1293
+ },
1294
+ good_handoff="triage_agent",
1295
+ accepted_fix_keywords=(
1296
+ ("raise", "probe", "timeout"),
1297
+ ("dampen", "autoscaler", "cooldown"),
1298
+ ("relax", "liveness", "threshold"),
1299
+ ),
1300
+ required_investigations=2,
1301
+ customer_tier="standard",
1302
+ affected_users_estimate=31_000,
1303
+ revenue_impact_usd_per_min=210,
1304
+ requires_mitigation=True,
1305
+ )
1306
+
1307
+
1308
+ def _payment_webhook_dedupe() -> IncidentTemplate:
1309
+ return IncidentTemplate(
1310
+ id="INC-M10",
1311
+ title="Payment confirmations delivered twice to enterprise partners",
1312
+ description=(
1313
+ "Two enterprise payment partners received the same confirmation "
1314
+ "webhook twice for a subset of transactions."
1315
+ ),
1316
+ category="payments",
1317
+ difficulty="medium",
1318
+ root_cause="webhook_dedupe_window_too_narrow",
1319
+ root_cause_synonyms=(
1320
+ "webhook dedupe window too narrow",
1321
+ "payment webhook duplicate delivery",
1322
+ "idempotency window clock drift",
1323
+ ),
1324
+ clue_keywords=("webhook", "dedupe", "idempotency", "window"),
1325
+ signals=(
1326
+ "Duplicates concentrated on retries across failover boundary",
1327
+ "Dedupe cache TTL is shorter than retry backoff",
1328
+ ),
1329
+ logs={
1330
+ "payments-webhook": "Duplicate delivery for txn T-332a after dedupe cache eviction",
1331
+ "scheduler": "Retry backoff 90s; dedupe ttl=60s",
1332
+ },
1333
+ red_herring_logs={
1334
+ "email-service": "steady",
1335
+ },
1336
+ metrics={
1337
+ "dash-payments": "duplicate_webhook_rate 0.9%, dedupe_hit_rate 88%",
1338
+ },
1339
+ kb={
1340
+ "kb-webhook-dedupe": "Dedupe TTL must exceed the maximum retry backoff window.",
1341
+ },
1342
+ good_handoff="investigator_agent",
1343
+ accepted_fix_keywords=(
1344
+ ("extend", "dedupe", "ttl"),
1345
+ ("shrink", "retry", "backoff"),
1346
+ ("persist", "dedupe", "store"),
1347
+ ),
1348
+ required_investigations=2,
1349
+ customer_tier="enterprise",
1350
+ affected_users_estimate=620,
1351
+ revenue_impact_usd_per_min=480,
1352
+ requires_mitigation=True,
1353
+ postmortem_required=True,
1354
+ )
1355
+
1356
+
1357
+ def _origin_shield_bypass() -> IncidentTemplate:
1358
+ return IncidentTemplate(
1359
+ id="INC-M11",
1360
+ title="Origin overloaded after CDN policy change",
1361
+ description=(
1362
+ "Origin servers are seeing 5x normal traffic because a CDN "
1363
+ "policy change disabled origin shield for a large segment."
1364
+ ),
1365
+ category="cdn",
1366
+ difficulty="medium",
1367
+ root_cause="origin_shield_bypass_after_policy_change",
1368
+ root_cause_synonyms=(
1369
+ "origin shield bypass after policy change",
1370
+ "shield disabled for segment",
1371
+ "cache hierarchy collapsed",
1372
+ ),
1373
+ clue_keywords=("origin", "shield", "cdn", "policy"),
1374
+ signals=(
1375
+ "Origin 5xx rate climbs as CDN hit ratio collapses",
1376
+ "New CDN policy rolled out exactly at fault onset",
1377
+ ),
1378
+ logs={
1379
+ "cdn-policy": "Policy v5 removed shield targeting for premium segment",
1380
+ "origin-lb": "Connection queue depth spiking 5x baseline",
1381
+ },
1382
+ red_herring_logs={
1383
+ "dns-resolver": "no anomalies",
1384
+ },
1385
+ metrics={
1386
+ "dash-cdn": "hit_ratio 67% (baseline 94%)",
1387
+ "dash-origin": "rps 5.2x baseline, 5xx_rate 7.1%",
1388
+ },
1389
+ kb={
1390
+ "kb-origin-shield": "Changes to shield routing must go through shadow traffic before promotion.",
1391
+ },
1392
+ good_handoff="investigator_agent",
1393
+ accepted_fix_keywords=(
1394
+ ("rollback", "cdn", "policy"),
1395
+ ("re-enable", "origin", "shield"),
1396
+ ("route", "through", "shield"),
1397
+ ),
1398
+ required_investigations=3,
1399
+ customer_tier="premium",
1400
+ affected_users_estimate=240_000,
1401
+ revenue_impact_usd_per_min=1_300,
1402
+ requires_mitigation=True,
1403
+ postmortem_required=True,
1404
+ )
1405
+
1406
+
1407
+ def _gpu_memory_fragmentation() -> IncidentTemplate:
1408
+ return IncidentTemplate(
1409
+ id="INC-H6",
1410
+ title="LLM inference latency drifts up on production A100 pool",
1411
+ description=(
1412
+ "Enterprise API latency for the inference gateway has drifted "
1413
+ "from 420ms to 1.4s over 36 hours, with OOMs on larger prompts."
1414
+ ),
1415
+ category="ml_inference",
1416
+ difficulty="hard",
1417
+ root_cause="gpu_memory_fragmentation_after_prompt_schema_change",
1418
+ root_cause_synonyms=(
1419
+ "gpu memory fragmentation after prompt schema change",
1420
+ "kv cache fragmentation",
1421
+ "inference pool memory fragmentation",
1422
+ ),
1423
+ clue_keywords=("gpu", "memory", "fragmentation", "kv", "cache"),
1424
+ signals=(
1425
+ "Free VRAM fragmented into small blocks even though total free > 18GB",
1426
+ "OOM errors concentrate on prompts >2k tokens",
1427
+ ),
1428
+ logs={
1429
+ "inference-gateway": "CUDA OOM despite torch reports 18GB free; fragmentation detected",
1430
+ "model-runner": "Prompt schema v3 increased variable sequence lengths",
1431
+ },
1432
+ red_herring_logs={
1433
+ "auth-service": "steady",
1434
+ },
1435
+ metrics={
1436
+ "dash-inference": "p99_latency_ms 1400, oom_rate 3.2%",
1437
+ "dash-gpu": "vram_fragmentation_score 0.74",
1438
+ },
1439
+ kb={
1440
+ "kb-vram": "Recycle inference workers daily and pad sequences to bucketed lengths.",
1441
+ },
1442
+ good_handoff="investigator_agent",
1443
+ accepted_fix_keywords=(
1444
+ ("recycle", "inference", "workers"),
1445
+ ("bucket", "prompt", "lengths"),
1446
+ ("rollback", "prompt", "schema"),
1447
+ ),
1448
+ required_investigations=3,
1449
+ customer_tier="enterprise",
1450
+ affected_users_estimate=5_200,
1451
+ revenue_impact_usd_per_min=1_850,
1452
+ requires_mitigation=True,
1453
+ postmortem_required=True,
1454
+ )
1455
+
1456
+
1457
+ def _replication_saturation() -> IncidentTemplate:
1458
+ return IncidentTemplate(
1459
+ id="INC-H7",
1460
+ title="Cross-region replication lag blocks disaster-recovery RPO",
1461
+ description=(
1462
+ "Replication lag from the primary region to DR has exceeded "
1463
+ "five minutes for the last hour, violating RPO=60s."
1464
+ ),
1465
+ category="data",
1466
+ difficulty="hard",
1467
+ root_cause="replication_saturation_during_backup_window",
1468
+ root_cause_synonyms=(
1469
+ "replication saturation during backup window",
1470
+ "wal shipping backpressure",
1471
+ "replica network saturation",
1472
+ ),
1473
+ clue_keywords=("replication", "lag", "wal", "rpo", "backup"),
1474
+ signals=(
1475
+ "Lag correlates exactly with nightly backup window",
1476
+ "Network egress saturated on primary -> DR link",
1477
+ ),
1478
+ logs={
1479
+ "db-primary": "WAL shipping backpressure; replica slot lagging 6.2m",
1480
+ "backup-job": "Base backup in progress; 4.1 GB/s read rate",
1481
+ },
1482
+ red_herring_logs={
1483
+ "notification-gateway": "steady delivery",
1484
+ },
1485
+ metrics={
1486
+ "dash-replication": "lag_seconds 372 (rpo=60)",
1487
+ "dash-network": "egress_primary_to_dr 9.8 Gbps (cap=10)",
1488
+ },
1489
+ kb={
1490
+ "kb-replication-backup": "Throttle backup or move it off hours of peak replication traffic.",
1491
+ },
1492
+ good_handoff="ops_manager_agent",
1493
+ accepted_fix_keywords=(
1494
+ ("throttle", "backup", "rate"),
1495
+ ("shift", "backup", "window"),
1496
+ ("raise", "replication", "bandwidth"),
1497
+ ),
1498
+ required_investigations=3,
1499
+ customer_tier="enterprise",
1500
+ affected_users_estimate=8_900,
1501
+ revenue_impact_usd_per_min=1_400,
1502
+ requires_mitigation=True,
1503
+ postmortem_required=True,
1504
+ )
1505
+
1506
+
1507
+ def _cache_key_collision() -> IncidentTemplate:
1508
+ return IncidentTemplate(
1509
+ id="INC-H8",
1510
+ title="Cross-tenant data bleed from cache key collision",
1511
+ description=(
1512
+ "A rare cache key collision is briefly returning one enterprise "
1513
+ "tenant's data to another. This is a data-isolation incident."
1514
+ ),
1515
+ category="security",
1516
+ difficulty="hard",
1517
+ root_cause="cache_key_collision_across_tenants",
1518
+ root_cause_synonyms=(
1519
+ "cache key collision across tenants",
1520
+ "shared cache tenant bleed",
1521
+ "tenant id missing from cache key",
1522
+ ),
1523
+ clue_keywords=("cache", "key", "collision", "tenant"),
1524
+ signals=(
1525
+ "Two enterprise tenants report seeing each other's dashboard metadata",
1526
+ "Cache key construction omits tenant-id under a specific code path",
1527
+ ),
1528
+ logs={
1529
+ "api-gateway": "Cache HIT for key=/v2/workspace/42 served to tenant=91",
1530
+ "cache-layer": "Collision detected between tenants 42 and 91 on key prefix /v2/workspace",
1531
+ },
1532
+ red_herring_logs={
1533
+ "email-service": "steady",
1534
+ },
1535
+ metrics={
1536
+ "dash-cache": "collision_count 14 in last 2h",
1537
+ "dash-security": "isolation_violations 2",
1538
+ },
1539
+ kb={
1540
+ "kb-cache-tenant": "Prefix every cache key with tenant_id and enforce via lint check.",
1541
+ },
1542
+ good_handoff="ops_manager_agent",
1543
+ accepted_fix_keywords=(
1544
+ ("prefix", "tenant", "cache"),
1545
+ ("invalidate", "shared", "cache"),
1546
+ ("quarantine", "cache", "segment"),
1547
+ ),
1548
+ required_investigations=3,
1549
+ customer_tier="enterprise",
1550
+ affected_users_estimate=320,
1551
+ revenue_impact_usd_per_min=2_100,
1552
+ requires_mitigation=True,
1553
+ postmortem_required=True,
1554
+ )
1555
+
1556
+
1557
+ def _cron_dst_double_trigger() -> IncidentTemplate:
1558
+ return IncidentTemplate(
1559
+ id="INC-H9",
1560
+ title="Scheduled jobs fire twice at DST rollover",
1561
+ description=(
1562
+ "Key premium billing jobs executed twice at the daylight-saving "
1563
+ "transition, causing premium charge duplicates."
1564
+ ),
1565
+ category="scheduling",
1566
+ difficulty="hard",
1567
+ root_cause="cron_dst_transition_double_trigger",
1568
+ root_cause_synonyms=(
1569
+ "cron dst transition double trigger",
1570
+ "scheduler timezone ambiguity",
1571
+ "dst fallback replay",
1572
+ ),
1573
+ clue_keywords=("cron", "dst", "timezone", "scheduler"),
1574
+ signals=(
1575
+ "Job history shows two runs at 01:00 and 01:00 local time",
1576
+ "Billing duplicates concentrate on a single geographic region",
1577
+ ),
1578
+ logs={
1579
+ "scheduler": "Fired job billing.nightly at 2026-03-29 01:00 (GMT+1 and GMT+0)",
1580
+ "billing-worker": "Second invocation completed 12 minutes after first",
1581
+ },
1582
+ red_herring_logs={
1583
+ "catalog-api": "steady 2xx",
1584
+ },
1585
+ metrics={
1586
+ "dash-scheduler": "double_fire_count 3 (expected 0)",
1587
+ "dash-billing": "duplicate_charge_rate 2.1%",
1588
+ },
1589
+ kb={
1590
+ "kb-dst-schedule": "Anchor scheduled jobs on UTC and convert to local time at display only.",
1591
+ },
1592
+ good_handoff="investigator_agent",
1593
+ accepted_fix_keywords=(
1594
+ ("anchor", "schedule", "utc"),
1595
+ ("deduplicate", "scheduled", "runs"),
1596
+ ("reconcile", "duplicate", "charges"),
1597
+ ),
1598
+ required_investigations=3,
1599
+ customer_tier="premium",
1600
+ affected_users_estimate=6_400,
1601
+ revenue_impact_usd_per_min=1_100,
1602
+ requires_mitigation=True,
1603
+ postmortem_required=True,
1604
+ )
1605
+
1606
+
1607
+ def _partial_publish_feed() -> IncidentTemplate:
1608
+ return IncidentTemplate(
1609
+ id="INC-H10",
1610
+ title="Real-time feed gaps during partial publish",
1611
+ description=(
1612
+ "Premium trading-floor customers see gaps in the realtime price "
1613
+ "feed after a publisher restart; some updates never arrived."
1614
+ ),
1615
+ category="realtime",
1616
+ difficulty="hard",
1617
+ root_cause="partial_publish_without_transaction_boundary",
1618
+ root_cause_synonyms=(
1619
+ "partial publish without transaction boundary",
1620
+ "publisher crash mid batch",
1621
+ "realtime feed gap",
1622
+ ),
1623
+ clue_keywords=("publish", "transaction", "feed", "partial"),
1624
+ signals=(
1625
+ "Sequence numbers skip in a bounded window around the publisher restart",
1626
+ "Replay API can fill the gap but live subscribers missed it",
1627
+ ),
1628
+ logs={
1629
+ "price-publisher": "Process restarted mid-batch, seq=88230 not flushed",
1630
+ "realtime-bus": "Detected sequence gap 88230-88236 on channel=prices.us",
1631
+ },
1632
+ red_herring_logs={
1633
+ "auth-service": "steady",
1634
+ },
1635
+ metrics={
1636
+ "dash-realtime": "gap_count 6 in 30s, subscriber_reconcile_lag_s 48",
1637
+ },
1638
+ kb={
1639
+ "kb-publish-txn": "Wrap each batch in a transactional publish so crashes never leave gaps.",
1640
+ },
1641
+ good_handoff="investigator_agent",
1642
+ accepted_fix_keywords=(
1643
+ ("enable", "transactional", "publish"),
1644
+ ("replay", "sequence", "gap"),
1645
+ ("force", "subscriber", "reconcile"),
1646
+ ),
1647
+ required_investigations=3,
1648
+ customer_tier="premium",
1649
+ affected_users_estimate=3_900,
1650
+ revenue_impact_usd_per_min=1_750,
1651
+ requires_mitigation=True,
1652
+ postmortem_required=True,
1653
+ )
1654
+
1655
+
1656
+ def _ssd_firmware_regression() -> IncidentTemplate:
1657
+ return IncidentTemplate(
1658
+ id="INC-H11",
1659
+ title="Storage checksum failures on upgraded SSD fleet",
1660
+ description=(
1661
+ "Enterprise object storage is returning checksum-mismatch errors "
1662
+ "on a subset of volumes after a firmware roll-forward."
1663
+ ),
1664
+ category="storage",
1665
+ difficulty="hard",
1666
+ root_cause="ssd_firmware_checksum_regression",
1667
+ root_cause_synonyms=(
1668
+ "ssd firmware checksum regression",
1669
+ "storage firmware corruption",
1670
+ "nvme firmware crc bug",
1671
+ ),
1672
+ clue_keywords=("firmware", "ssd", "checksum", "storage"),
1673
+ signals=(
1674
+ "Checksum failures concentrate on volumes upgraded in the last 72 hours",
1675
+ "Vendor advisory mentions similar symptoms after firmware F2.14",
1676
+ ),
1677
+ logs={
1678
+ "storage-agent": "CRC mismatch on volume vol-221 firmware=F2.14",
1679
+ "fleet-manager": "Upgrade batch included F2.14 for 18 volumes",
1680
+ },
1681
+ red_herring_logs={
1682
+ "email-service": "steady",
1683
+ },
1684
+ metrics={
1685
+ "dash-storage": "checksum_error_rate 0.8%",
1686
+ "dash-fleet": "volumes_on_F2.14 18, volumes_healthy 402",
1687
+ },
1688
+ kb={
1689
+ "kb-ssd-firmware": "Quarantine affected firmware and roll back to the last known-good version.",
1690
+ },
1691
+ good_handoff="ops_manager_agent",
1692
+ accepted_fix_keywords=(
1693
+ ("rollback", "ssd", "firmware"),
1694
+ ("quarantine", "affected", "volumes"),
1695
+ ("reseed", "checksum", "index"),
1696
+ ),
1697
+ required_investigations=3,
1698
+ customer_tier="enterprise",
1699
+ affected_users_estimate=1_800,
1700
+ revenue_impact_usd_per_min=1_950,
1701
+ requires_mitigation=True,
1702
+ postmortem_required=True,
1703
+ )
1704
+
1705
+
1706
  def build_incident_library() -> IncidentLibrary:
1707
+ """Return the built-in enterprise incident library (30 templates)."""
1708
  return IncidentLibrary(
1709
  templates_by_task={
1710
+ "easy": [
1711
+ _redis_pool(),
1712
+ _jwt_clock_skew(),
1713
+ _email_spam_false_positive(),
1714
+ _dns_ttl_stale(),
1715
+ _cdn_purge_scope(),
1716
+ _autocomplete_stale(),
1717
+ _webhook_retry_budget(),
1718
+ _thumbnail_worker_oom(),
1719
+ ],
1720
  "medium": [
1721
  _cache_invalidation_lag(),
1722
  _tz_normalization(),
1723
  _invoice_idempotency(),
1724
  _tls_expiry(),
1725
  _feature_flag_rollout(),
1726
+ _recommender_heap_leak(),
1727
+ _consumer_group_rebalance(),
1728
+ _config_push_skipped_canary(),
1729
+ _health_check_flapping(),
1730
+ _payment_webhook_dedupe(),
1731
+ _origin_shield_bypass(),
1732
  ],
1733
  "hard": [
1734
  _promo_rate_cascade(),
 
1736
  _alert_storm(),
1737
  _inventory_race(),
1738
  _deadlock_database(),
1739
+ _gpu_memory_fragmentation(),
1740
+ _replication_saturation(),
1741
+ _cache_key_collision(),
1742
+ _cron_dst_double_trigger(),
1743
+ _partial_publish_feed(),
1744
+ _ssd_firmware_regression(),
1745
  ],
1746
  }
1747
  )
server/llm_remote.py ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Thin client for calling a remote LLM from the FastAPI server.
2
+
3
+ Used by the dashboard's "live inference" panel so a Hugging Face Space can
4
+ delegate the expensive forward pass to a dedicated HF Inference Endpoint
5
+ (GPU-backed) without loading the model inside the Space container.
6
+
7
+ Two backends are supported:
8
+
9
+ - ``chat`` (default) β€” OpenAI-compatible ``/v1/chat/completions`` endpoint.
10
+ Hugging Face TGI-based Inference Endpoints expose this path, as do most
11
+ vLLM deployments. This is the recommended setup.
12
+ - ``generate`` β€” Raw TGI ``/generate`` endpoint. Useful when chat templating
13
+ is already baked into the prompt and you just want raw text completion.
14
+
15
+ Configuration via environment variables (set them as HF Space secrets):
16
+
17
+ - ``LLM_ENDPOINT_URL`` β€” **required** to enable the panel. E.g.
18
+ ``https://abc.us-east-1.aws.endpoints.huggingface.cloud``. Without this
19
+ env var, ``is_configured()`` returns ``False`` and the dashboard shows a
20
+ setup hint instead of the demo.
21
+ - ``HF_TOKEN`` β€” **required**. A Hugging Face token with ``read``
22
+ scope over the model repo powering the endpoint.
23
+ - ``LLM_ENDPOINT_MODE`` β€” optional, one of ``chat`` / ``generate``
24
+ (default: ``chat``).
25
+ - ``LLM_MODEL_ID`` β€” optional display / routing hint the endpoint
26
+ sometimes cares about (default: ``"tgi"``).
27
+ - ``LLM_MAX_NEW_TOKENS``β€” optional integer (default: ``160``).
28
+ - ``LLM_TIMEOUT_S`` β€” optional integer (default: ``25``).
29
+
30
+ The module uses only the Python stdlib (``urllib.request``) so it adds
31
+ zero extra dependencies to the HF Space Docker image.
32
+ """
33
+
34
+ from __future__ import annotations
35
+
36
+ import json
37
+ import logging
38
+ import os
39
+ import socket
40
+ import urllib.error
41
+ import urllib.request
42
+ from dataclasses import dataclass
43
+ from typing import Any, Dict, Optional
44
+
45
+ _LOG = logging.getLogger("icc.llm_remote")
46
+
47
+
48
+ @dataclass(frozen=True)
49
+ class RemoteLLMConfig:
50
+ endpoint_url: str
51
+ token: str
52
+ mode: str = "chat" # "chat" | "generate"
53
+ model_id: str = "tgi"
54
+ max_new_tokens: int = 160
55
+ timeout_s: int = 25
56
+
57
+ @classmethod
58
+ def from_env(cls) -> Optional["RemoteLLMConfig"]:
59
+ url = os.environ.get("LLM_ENDPOINT_URL", "").strip()
60
+ token = os.environ.get("HF_TOKEN", "").strip()
61
+ if not url or not token:
62
+ return None
63
+ return cls(
64
+ endpoint_url=url.rstrip("/"),
65
+ token=token,
66
+ mode=os.environ.get("LLM_ENDPOINT_MODE", "chat").strip().lower() or "chat",
67
+ model_id=os.environ.get("LLM_MODEL_ID", "tgi").strip() or "tgi",
68
+ max_new_tokens=int(os.environ.get("LLM_MAX_NEW_TOKENS", "160")),
69
+ timeout_s=int(os.environ.get("LLM_TIMEOUT_S", "25")),
70
+ )
71
+
72
+
73
+ def is_configured() -> bool:
74
+ """Return True iff env vars required for remote inference are set."""
75
+ return RemoteLLMConfig.from_env() is not None
76
+
77
+
78
+ def status_summary() -> Dict[str, Any]:
79
+ """Lightweight status object for the dashboard to surface."""
80
+ cfg = RemoteLLMConfig.from_env()
81
+ if cfg is None:
82
+ return {
83
+ "configured": False,
84
+ "reason": (
85
+ "Set LLM_ENDPOINT_URL and HF_TOKEN as Space secrets to enable "
86
+ "the live inference panel."
87
+ ),
88
+ }
89
+ return {
90
+ "configured": True,
91
+ "mode": cfg.mode,
92
+ "model_id": cfg.model_id,
93
+ "max_new_tokens": cfg.max_new_tokens,
94
+ # Never surface the token; just confirm it is present.
95
+ "token_present": bool(cfg.token),
96
+ # Only expose the host (not the full URL, in case a query-string key
97
+ # ever leaks into env by accident).
98
+ "host": _safe_host(cfg.endpoint_url),
99
+ }
100
+
101
+
102
+ # ---------------------------------------------------------------------------
103
+ # Internals
104
+ # ---------------------------------------------------------------------------
105
+
106
+
107
+ def _safe_host(url: str) -> str:
108
+ try:
109
+ return url.split("://", 1)[-1].split("/", 1)[0]
110
+ except Exception:
111
+ return "(unknown)"
112
+
113
+
114
+ def _http_post(url: str, headers: Dict[str, str], body: bytes, timeout_s: int) -> str:
115
+ req = urllib.request.Request(url, data=body, headers=headers, method="POST")
116
+ try:
117
+ with urllib.request.urlopen(req, timeout=timeout_s) as resp:
118
+ return resp.read().decode("utf-8", errors="replace")
119
+ except urllib.error.HTTPError as exc:
120
+ raise RuntimeError(
121
+ f"LLM endpoint returned HTTP {exc.code}: {exc.read().decode('utf-8', errors='replace')[:400]}"
122
+ ) from exc
123
+ except (urllib.error.URLError, socket.timeout, TimeoutError) as exc:
124
+ raise RuntimeError(f"LLM endpoint unreachable: {exc}") from exc
125
+
126
+
127
+ def _call_chat(cfg: RemoteLLMConfig, prompt: str) -> str:
128
+ url = f"{cfg.endpoint_url}/v1/chat/completions"
129
+ payload = {
130
+ "model": cfg.model_id,
131
+ "messages": [{"role": "user", "content": prompt}],
132
+ "temperature": 0.0,
133
+ "max_tokens": cfg.max_new_tokens,
134
+ "stream": False,
135
+ }
136
+ headers = {
137
+ "Content-Type": "application/json",
138
+ "Authorization": f"Bearer {cfg.token}",
139
+ }
140
+ raw = _http_post(url, headers, json.dumps(payload).encode("utf-8"), cfg.timeout_s)
141
+ try:
142
+ data = json.loads(raw)
143
+ except json.JSONDecodeError as exc:
144
+ raise RuntimeError(f"LLM endpoint returned non-JSON: {raw[:400]}") from exc
145
+ try:
146
+ return data["choices"][0]["message"]["content"]
147
+ except (KeyError, IndexError, TypeError) as exc:
148
+ raise RuntimeError(f"Unexpected chat response shape: {raw[:400]}") from exc
149
+
150
+
151
+ def _call_generate(cfg: RemoteLLMConfig, prompt: str) -> str:
152
+ url = f"{cfg.endpoint_url}/generate"
153
+ payload = {
154
+ "inputs": prompt,
155
+ "parameters": {
156
+ "max_new_tokens": cfg.max_new_tokens,
157
+ "temperature": 0.0,
158
+ "do_sample": False,
159
+ "return_full_text": False,
160
+ },
161
+ }
162
+ headers = {
163
+ "Content-Type": "application/json",
164
+ "Authorization": f"Bearer {cfg.token}",
165
+ }
166
+ raw = _http_post(url, headers, json.dumps(payload).encode("utf-8"), cfg.timeout_s)
167
+ try:
168
+ data = json.loads(raw)
169
+ except json.JSONDecodeError as exc:
170
+ raise RuntimeError(f"LLM endpoint returned non-JSON: {raw[:400]}") from exc
171
+ # TGI returns either {"generated_text": "..."} or a list of such objects.
172
+ if isinstance(data, list) and data:
173
+ data = data[0]
174
+ if isinstance(data, dict) and "generated_text" in data:
175
+ return str(data["generated_text"])
176
+ raise RuntimeError(f"Unexpected /generate response shape: {raw[:400]}")
177
+
178
+
179
+ def generate(prompt: str) -> str:
180
+ """Send ``prompt`` to the configured remote endpoint and return raw text.
181
+
182
+ Raises RuntimeError with a human-readable message on any failure so the
183
+ caller (the FastAPI demo endpoint) can surface it in the dashboard.
184
+ """
185
+ cfg = RemoteLLMConfig.from_env()
186
+ if cfg is None:
187
+ raise RuntimeError(
188
+ "Remote LLM not configured. Set LLM_ENDPOINT_URL and HF_TOKEN."
189
+ )
190
+ _LOG.info("Calling remote LLM %s mode=%s", _safe_host(cfg.endpoint_url), cfg.mode)
191
+ if cfg.mode == "generate":
192
+ return _call_generate(cfg, prompt)
193
+ return _call_chat(cfg, prompt)