SwapnilPatil28 commited on
Commit
8cbdbde
·
verified ·
1 Parent(s): 540b82c

Final Update

Browse files
README.md CHANGED
@@ -43,7 +43,7 @@ A **virtual war room** where three specialist agents resolve a live queue of rea
43
  | 🧪 **Investigator** | Apply a fix · roll back a deploy | Escalate or file a post-mortem |
44
  | 👷 **Ops Manager** | Escalate · file post-mortem · **close the ticket** | Apply a code fix |
45
 
46
- **30 unique incident templates** · **3 difficulty tiers** (8 easy / 11 medium / 11 hard) · **14+ named reward signals** · **customer-tier weighting** (enterprise outages cost ~3× a free-tier outage)
47
 
48
  > Wrong actor → **−0.08**. Wrong root-cause on an enterprise ticket → **−1.98**. Correct closure on an enterprise ticket → **+1.44**. The rules matter — and every step tells you *why* it was scored.
49
 
@@ -113,7 +113,6 @@ Same pipeline, same data recipe, smaller backbone:
113
  | 💻 **Source code** | **[GitHub repo ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
114
  | 🎓 **Reproduce the training** | **[One-click Colab notebook ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
115
  | 📝 **Mini blog post** (the required short writeup) | **[`docs/BLOG_POST.md`](./docs/BLOG_POST.md)** |
116
- | 🎬 **2-minute video script** (optional bonus) | **[`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md)** |
117
 
118
  > Want the rubric math, architecture, full numbers, configuration, and the hackathon checklist? Keep scrolling — **Part 2** is the full technical README.
119
 
@@ -130,7 +129,6 @@ Same pipeline, same data recipe, smaller backbone:
130
  | GitHub repository | **[`github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
131
  | Training notebook (Colab T4, one-click reproducible) | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
132
  | Mini blog post (the required short writeup) | [`docs/BLOG_POST.md`](./docs/BLOG_POST.md) |
133
- | 2-minute video script (optional bonus) | [`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md) |
134
  | Submission checklist | [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md) |
135
  | Training script (Python) | [`train_trl.py`](./train_trl.py) |
136
 
@@ -638,7 +636,6 @@ Two scripts judges (or you) can run without a local IDE:
638
 
639
  ├── docs/
640
  │ ├── BLOG_POST.md # The short writeup (rule 4) — renders on HF Space + GitHub
641
- │ ├── VIDEO_SCRIPT.md # Optional 2-minute walkthrough script
642
  │ └── SUBMISSION_CHECKLIST.md # Judging-criteria status + smoke tests
643
 
644
  ├── artifacts/ # All committed training evidence
@@ -661,7 +658,7 @@ Two scripts judges (or you) can run without a local IDE:
661
  │ ├── Dockerfile # Production image (HEALTHCHECK included)
662
  │ └── domain/
663
  │ ├── __init__.py
664
- │ ├── incidents.py # 30 enterprise incident templates + factory
665
  │ ├── reward.py # Composable rubric engine (20+ components)
666
  │ ├── roles.py # Role-based permission policy
667
  │ └── rng.py # Deterministic per-episode RNG
@@ -697,7 +694,7 @@ ENV_LOG_LEVEL: "INFO"
697
  Full checklist with pre-submission smoke tests → [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md).
698
 
699
  - [x] **OpenEnv latest runtime** and `openenv validate` passing — [Space live](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)
700
- - [x] **Multi-agent, long-horizon environment** with role-gated action space (3 roles × 9 actions, **30 unique incident templates**)
701
  - [x] **Composable, transparent, anti-gaming reward rubric** (14+ named components, tier-scaled)
702
  - [x] **Business-impact-aware scoring** (customer tier, revenue impact, SLA countdown)
703
  - [x] **End-to-end TRL SFT pipeline** that saves a checkpoint and re-evaluates it in the environment ([`train_trl.py`](./train_trl.py))
@@ -708,7 +705,6 @@ Full checklist with pre-submission smoke tests → [`docs/SUBMISSION_CHECKLIST.m
708
  - [x] **Structured JSON logging** + 12-factor configuration
709
  - [x] **One-click Colab training notebook** → [Open ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)
710
  - [x] **Mini blog post** published as an MD file on both the HF Space and GitHub: [`docs/BLOG_POST.md`](./docs/BLOG_POST.md)
711
- - [x] **2-minute video script** (optional bonus): [`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md)
712
  - [x] **Full submission checklist** mapping every rule → evidence: [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md)
713
 
714
  ---
 
43
  | 🧪 **Investigator** | Apply a fix · roll back a deploy | Escalate or file a post-mortem |
44
  | 👷 **Ops Manager** | Escalate · file post-mortem · **close the ticket** | Apply a code fix |
45
 
46
+ **13 real incidents** · **3 difficulty tiers** (easy / medium / hard) · **14+ named reward signals** · **customer-tier weighting** (enterprise outages cost ~3× a free-tier outage)
47
 
48
  > Wrong actor → **−0.08**. Wrong root-cause on an enterprise ticket → **−1.98**. Correct closure on an enterprise ticket → **+1.44**. The rules matter — and every step tells you *why* it was scored.
49
 
 
113
  | 💻 **Source code** | **[GitHub repo ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
114
  | 🎓 **Reproduce the training** | **[One-click Colab notebook ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
115
  | 📝 **Mini blog post** (the required short writeup) | **[`docs/BLOG_POST.md`](./docs/BLOG_POST.md)** |
 
116
 
117
  > Want the rubric math, architecture, full numbers, configuration, and the hackathon checklist? Keep scrolling — **Part 2** is the full technical README.
118
 
 
129
  | GitHub repository | **[`github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
130
  | Training notebook (Colab T4, one-click reproducible) | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
131
  | Mini blog post (the required short writeup) | [`docs/BLOG_POST.md`](./docs/BLOG_POST.md) |
 
132
  | Submission checklist | [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md) |
133
  | Training script (Python) | [`train_trl.py`](./train_trl.py) |
134
 
 
636
 
637
  ├── docs/
638
  │ ├── BLOG_POST.md # The short writeup (rule 4) — renders on HF Space + GitHub
 
639
  │ └── SUBMISSION_CHECKLIST.md # Judging-criteria status + smoke tests
640
 
641
  ├── artifacts/ # All committed training evidence
 
658
  │ ├── Dockerfile # Production image (HEALTHCHECK included)
659
  │ └── domain/
660
  │ ├── __init__.py
661
+ │ ├── incidents.py # 13 enterprise incident templates + factory
662
  │ ├── reward.py # Composable rubric engine (20+ components)
663
  │ ├── roles.py # Role-based permission policy
664
  │ └── rng.py # Deterministic per-episode RNG
 
694
  Full checklist with pre-submission smoke tests → [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md).
695
 
696
  - [x] **OpenEnv latest runtime** and `openenv validate` passing — [Space live](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)
697
+ - [x] **Multi-agent, long-horizon environment** with role-gated action space (3 roles × 9 actions, 13 incidents)
698
  - [x] **Composable, transparent, anti-gaming reward rubric** (14+ named components, tier-scaled)
699
  - [x] **Business-impact-aware scoring** (customer tier, revenue impact, SLA countdown)
700
  - [x] **End-to-end TRL SFT pipeline** that saves a checkpoint and re-evaluates it in the environment ([`train_trl.py`](./train_trl.py))
 
705
  - [x] **Structured JSON logging** + 12-factor configuration
706
  - [x] **One-click Colab training notebook** → [Open ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)
707
  - [x] **Mini blog post** published as an MD file on both the HF Space and GitHub: [`docs/BLOG_POST.md`](./docs/BLOG_POST.md)
 
708
  - [x] **Full submission checklist** mapping every rule → evidence: [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md)
709
 
710
  ---
docs/BLOG_POST.md CHANGED
@@ -13,7 +13,6 @@
13
  | 💻 **GitHub source code** | **[github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
14
  | 🎓 **Reproducible training (Colab T4)** | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
15
  | 📖 **Full README** (story + technical deep-dive) | **[github.com/.../README.md ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center#readme)** |
16
- | 🎬 **2-min video walkthrough script** (optional bonus) | [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) |
17
  | ✅ **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |
18
 
19
  ---
@@ -24,7 +23,7 @@
24
 
25
  Each role has **different permissions**, **different information needs**, and a **different clock to beat**. Get it wrong and you bleed budget, bust the SLA, and — if the customer is on an enterprise contract — lose serious money (~3× what a free-tier outage costs).
26
 
27
- I built a simulator of that war room — an **OpenEnv-compatible** environment with **30 realistic incident templates**, 3 specialist roles, and 14+ named reward signals — and fine-tuned an LLM to run it.
28
 
29
  | Role | Can do | Cannot do |
30
  |---|---|---|
@@ -234,7 +233,6 @@ I ran the exact same pipeline with the smaller **Qwen2.5-0.5B-Instruct** backbon
234
  | **Source + tests** | [GitHub repo (21 passing tests, Dockerfile with HEALTHCHECK)](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
235
  | **Full docs** | [README — Part 1 story + Part 2 technical deep-dive](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center#readme) |
236
  | **Committed evidence** | [`artifacts/`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/tree/main/artifacts) — all 4 PNGs + both JSON metric files |
237
- | **2-min video script** (optional bonus) | [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) |
238
  | **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |
239
 
240
  ---
@@ -242,9 +240,8 @@ I ran the exact same pipeline with the smaller **Qwen2.5-0.5B-Instruct** backbon
242
  ## 8. What's next
243
 
244
  - **Replace SFT with GRPO or PPO** using the environment's native reward signal — no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
245
- - **Grow the incident catalog further** (now at 30 templates next stop 50+ via JSON-defined scenarios).
246
  - **Add a second "adversarial" agent** that injects misleading signals to test robustness.
247
- - **Record the 2-minute walkthrough** from [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) as a bonus companion to this writeup.
248
 
249
  If you want to run it yourself, the Space and the repo are fully self-contained — `docker run` the image and point any OpenEnv-compatible client at it. Or just hit `/reset` and `/step` yourself from any language that can speak HTTP JSON.
250
 
 
13
  | 💻 **GitHub source code** | **[github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
14
  | 🎓 **Reproducible training (Colab T4)** | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
15
  | 📖 **Full README** (story + technical deep-dive) | **[github.com/.../README.md ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center#readme)** |
 
16
  | ✅ **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |
17
 
18
  ---
 
23
 
24
  Each role has **different permissions**, **different information needs**, and a **different clock to beat**. Get it wrong and you bleed budget, bust the SLA, and — if the customer is on an enterprise contract — lose serious money (~3× what a free-tier outage costs).
25
 
26
+ I built a simulator of that war room — an **OpenEnv-compatible** environment with 13 realistic incidents, 3 specialist roles, and 14+ named reward signals — and fine-tuned an LLM to run it.
27
 
28
  | Role | Can do | Cannot do |
29
  |---|---|---|
 
233
  | **Source + tests** | [GitHub repo (21 passing tests, Dockerfile with HEALTHCHECK)](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
234
  | **Full docs** | [README — Part 1 story + Part 2 technical deep-dive](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center#readme) |
235
  | **Committed evidence** | [`artifacts/`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/tree/main/artifacts) — all 4 PNGs + both JSON metric files |
 
236
  | **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |
237
 
238
  ---
 
240
  ## 8. What's next
241
 
242
  - **Replace SFT with GRPO or PPO** using the environment's native reward signal — no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
243
+ - **Scale the incident catalog** from 13 templates to 50+ (drop in JSON-defined scenarios).
244
  - **Add a second "adversarial" agent** that injects misleading signals to test robustness.
 
245
 
246
  If you want to run it yourself, the Space and the repo are fully self-contained — `docker run` the image and point any OpenEnv-compatible client at it. Or just hit `/reset` and `/step` yourself from any language that can speak HTTP JSON.
247
 
docs/SUBMISSION_CHECKLIST.md CHANGED
@@ -11,10 +11,10 @@ Status against every hard gate in the official judging rules, plus every polish
11
  | 1 | **Use OpenEnv (latest release). Build on top of the framework; don't reinvent the wheel.** | ✅ | `requirements.txt` pins `openenv-core>=0.2.2`, `openenv.yaml` has `version: "3.0"`, `server/environment.py` extends `openenv.core.environment.Environment`, app built via `openenv.core.env_server.create_fastapi_app`. |
12
  | 2 | **Working training script (Unsloth / HF TRL / any RL framework), ideally as a Colab notebook so judges can re-run it.** | ✅ | [`train_trl.py`](../train_trl.py) uses HF TRL `SFTTrainer`. **[One-click Colab notebook ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** runs the whole pipeline end-to-end on a T4 in ~1 h 15 min. |
13
  | 3 | **Evidence that you actually trained: at minimum, loss and reward plots from a real run.** | ✅ | Four plots committed to [`artifacts/`](../artifacts): `training_curve.png` (loss + token accuracy), `reward_curve.png` (4-policy reward by tier), `reward_components.png` (per-component breakdown), plus the 0.5B ablation `reward_curve_qwen0p5b.png`. Full `training_log.json` + `summary_metrics.json` committed alongside. |
14
- | 4 | **Short writeup or video: mini-blog on Hugging Face OR <2-min YouTube video, linked from README.** | ✅ | Mini-blog lives as [`docs/BLOG_POST.md`](./BLOG_POST.md) — shipped as part of the HF Space (rule 4 says "mini-blog on Hugging Face"; the Space is on HF and contains this file, so it renders at `huggingface.co/spaces/.../blob/main/docs/BLOG_POST.md`). All four training plots render inline via raw GitHub URLs. README and dashboard both link to it. A 2-minute walkthrough script is also committed at [`docs/VIDEO_SCRIPT.md`](./VIDEO_SCRIPT.md) as a bonus. |
15
  | 5 | **Push your environment to a Hugging Face Space so it's discoverable and runnable.** | ✅ | **Live at [`swapnilpatil28-multi-agent-incident-command-center.hf.space`](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** · Space page: [`huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center). |
16
  | 6 | **README motivates the problem, explains how the env works, and shows results.** | ✅ | [`README.md`](../README.md) — Part 1 ("Story in 2 minutes") opens with the problem in plain English, walks through the environment via role-permission tables, and shows all four plots + headline numbers. Part 2 is the full technical deep-dive (architecture, action/observation spaces, reward rubric, training pipeline, 0.5B ablation, ops/observability, testing, repo layout). |
17
- | 7 | **README links to the HF Space + all additional materials (video, blog, slides, etc.).** | ✅ | "Live links" table inside Part 2 of the README lists every resource. Part 1 also has a "Try it in 30 seconds" CTA table. The dashboard header plus "Resources & documentation" grid surface the same links from the live Space itself. |
18
  | 8 | **Do not include big video files in the HF submission — only public URLs.** | ✅ | No video files committed. All assets in [`artifacts/`](../artifacts) are PNG plots (≤ 162 KB each) + JSON. Repo weight is dominated by text and small images. |
19
 
20
  ---
@@ -26,7 +26,7 @@ Status against every hard gate in the official judging rules, plus every polish
26
  - [x] Multi-role, multi-agent — `triage_agent`, `investigator_agent`, `ops_manager_agent` with **non-overlapping permissions** (`server/domain/roles.py`).
27
  - [x] Long-horizon — 3–5 sequential incidents per episode, 20–60 steps each, shared SLA + budget counters.
28
  - [x] Professional / enterprise task simulation — realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
29
- - [x] **30 unique incident templates** across easy / medium / hard (`server/domain/incidents.py`) — 8 easy, 11 medium, 11 hard, covering services (payments, auth, CDN, search, DNS, ML inference, storage, scheduling, messaging, config distribution) and failure modes (OOM, cert expiry, config drift, DNS TTL staleness, rate-limit cascades, GPU fragmentation, cross-region replication lag, DST scheduler bugs, firmware regressions, cache-key tenant collisions).
30
  - [x] Rich observation schema — customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints, `reward_components`, `last_action_notes`.
31
  - [x] Composable reward rubric with **14+ named components** and anti-gaming safeguards (`server/domain/reward.py`).
32
  - [x] Tier-weighted business impact (`free ×0.6 · standard ×1.0 · premium ×1.4 · enterprise ×1.8`).
@@ -37,9 +37,8 @@ Status against every hard gate in the official judging rules, plus every polish
37
  - [x] README **Part 1 — The story in 2 minutes** written in plain English, readable by a non-technical judge in under 3 minutes.
38
  - [x] Every plot has a one-line caption explaining what it shows.
39
  - [x] Blog post [`docs/BLOG_POST.md`](./BLOG_POST.md) — eight labelled sections, four plots inline via raw GitHub URLs (render everywhere), 0.5B-vs-1.5B ablation narrative, explicit hackathon-theme mapping.
40
- - [x] Live HF Space dashboard has a **"Story in 2 minutes"** hero panel at the top, a role-permission table, a three-card theme mapping, and a "Resources & documentation" grid with 8 click-through links.
41
- - [x] Video script [`docs/VIDEO_SCRIPT.md`](./VIDEO_SCRIPT.md) committed (optional bonus; the blog satisfies the writeup rule by itself).
42
- - [x] All documentation cross-links cleanly — README ↔ dashboard ↔ blog post ↔ video script ↔ checklist.
43
 
44
  ### Improvement in Rewards (20%)
45
 
@@ -83,9 +82,9 @@ Status against every hard gate in the official judging rules, plus every polish
83
  |---|---|---|
84
  | 1 | Final training run (Qwen2.5-1.5B, 8 eps/task, 3 epochs) → all artifacts committed | ✅ |
85
  | 2 | Commit artifacts (`reward_curve.png`, `training_curve.png`, `reward_components.png`, `reward_curve_qwen0p5b.png`, `training_log.json`, `summary_metrics.json`, `summary_metrics_qwen0p5b.json`) | ✅ |
86
- | 3 | Update README with real numbers + real Space / Colab / GitHub / blog / video-script links | ✅ |
87
  | 4 | Deploy HF Space from the same commit | ✅ |
88
- | 5 | Dashboard upgraded: hero story panel, 4 stacked plots, resources grid with README / blog / video-script / checklist links | ✅ |
89
  | 6 | Blog post updated (`docs/BLOG_POST.md`) with fixed image paths (raw GitHub URLs) and 0.5B ablation section | ✅ |
90
  | 7 | All 21 tests passing on latest commit | ✅ |
91
  | 8 | Run `openenv validate` remotely against the Space — `./validate-submission.sh <space-url>` | ⬜ (run it once before the deadline) |
@@ -121,4 +120,3 @@ ENV_URL=https://swapnilpatil28-multi-agent-incident-command-center.hf.space pyth
121
  | Mini blog post (MD file in the repo, renders on both HF Space and GitHub) | [`docs/BLOG_POST.md`](./BLOG_POST.md) |
122
  | Reproducible training notebook | [Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing) |
123
  | Training evidence (all 4 plots + JSON metrics) | [`artifacts/`](../artifacts) folder |
124
- | 2-minute video script (optional bonus) | [`docs/VIDEO_SCRIPT.md`](./VIDEO_SCRIPT.md) |
 
11
  | 1 | **Use OpenEnv (latest release). Build on top of the framework; don't reinvent the wheel.** | ✅ | `requirements.txt` pins `openenv-core>=0.2.2`, `openenv.yaml` has `version: "3.0"`, `server/environment.py` extends `openenv.core.environment.Environment`, app built via `openenv.core.env_server.create_fastapi_app`. |
12
  | 2 | **Working training script (Unsloth / HF TRL / any RL framework), ideally as a Colab notebook so judges can re-run it.** | ✅ | [`train_trl.py`](../train_trl.py) uses HF TRL `SFTTrainer`. **[One-click Colab notebook ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** runs the whole pipeline end-to-end on a T4 in ~1 h 15 min. |
13
  | 3 | **Evidence that you actually trained: at minimum, loss and reward plots from a real run.** | ✅ | Four plots committed to [`artifacts/`](../artifacts): `training_curve.png` (loss + token accuracy), `reward_curve.png` (4-policy reward by tier), `reward_components.png` (per-component breakdown), plus the 0.5B ablation `reward_curve_qwen0p5b.png`. Full `training_log.json` + `summary_metrics.json` committed alongside. |
14
+ | 4 | **Short writeup or video: mini-blog on Hugging Face OR <2-min YouTube video, linked from README.** | ✅ | Mini-blog lives as [`docs/BLOG_POST.md`](./BLOG_POST.md) — shipped as part of the HF Space (rule 4 says "mini-blog on Hugging Face"; the Space is on HF and contains this file, so it renders at `huggingface.co/spaces/.../blob/main/docs/BLOG_POST.md`). All four training plots render inline via raw GitHub URLs. README and dashboard both link to it. (No separate video submission.) |
15
  | 5 | **Push your environment to a Hugging Face Space so it's discoverable and runnable.** | ✅ | **Live at [`swapnilpatil28-multi-agent-incident-command-center.hf.space`](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** · Space page: [`huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center). |
16
  | 6 | **README motivates the problem, explains how the env works, and shows results.** | ✅ | [`README.md`](../README.md) — Part 1 ("Story in 2 minutes") opens with the problem in plain English, walks through the environment via role-permission tables, and shows all four plots + headline numbers. Part 2 is the full technical deep-dive (architecture, action/observation spaces, reward rubric, training pipeline, 0.5B ablation, ops/observability, testing, repo layout). |
17
+ | 7 | **README links to the HF Space + all additional materials (blog, slides, etc.).** | ✅ | "Live links" table inside Part 2 of the README lists every resource. Part 1 also has a "Try it in 30 seconds" CTA table. The dashboard header plus "Resources & documentation" grid surface the same links from the live Space itself. |
18
  | 8 | **Do not include big video files in the HF submission — only public URLs.** | ✅ | No video files committed. All assets in [`artifacts/`](../artifacts) are PNG plots (≤ 162 KB each) + JSON. Repo weight is dominated by text and small images. |
19
 
20
  ---
 
26
  - [x] Multi-role, multi-agent — `triage_agent`, `investigator_agent`, `ops_manager_agent` with **non-overlapping permissions** (`server/domain/roles.py`).
27
  - [x] Long-horizon — 3–5 sequential incidents per episode, 20–60 steps each, shared SLA + budget counters.
28
  - [x] Professional / enterprise task simulation — realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
29
+ - [x] 13 unique incident templates across easy / medium / hard (`server/domain/incidents.py`).
30
  - [x] Rich observation schema — customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints, `reward_components`, `last_action_notes`.
31
  - [x] Composable reward rubric with **14+ named components** and anti-gaming safeguards (`server/domain/reward.py`).
32
  - [x] Tier-weighted business impact (`free ×0.6 · standard ×1.0 · premium ×1.4 · enterprise ×1.8`).
 
37
  - [x] README **Part 1 — The story in 2 minutes** written in plain English, readable by a non-technical judge in under 3 minutes.
38
  - [x] Every plot has a one-line caption explaining what it shows.
39
  - [x] Blog post [`docs/BLOG_POST.md`](./BLOG_POST.md) — eight labelled sections, four plots inline via raw GitHub URLs (render everywhere), 0.5B-vs-1.5B ablation narrative, explicit hackathon-theme mapping.
40
+ - [x] Live HF Space dashboard has a **"Story in 2 minutes"** hero panel at the top, a role-permission table, a three-card theme mapping, and a "Resources & documentation" grid with click-through links (README, blog, checklist, Colab, Space, etc.).
41
+ - [x] All documentation cross-links cleanly README dashboard blog post checklist.
 
42
 
43
  ### Improvement in Rewards (20%)
44
 
 
82
  |---|---|---|
83
  | 1 | Final training run (Qwen2.5-1.5B, 8 eps/task, 3 epochs) → all artifacts committed | ✅ |
84
  | 2 | Commit artifacts (`reward_curve.png`, `training_curve.png`, `reward_components.png`, `reward_curve_qwen0p5b.png`, `training_log.json`, `summary_metrics.json`, `summary_metrics_qwen0p5b.json`) | ✅ |
85
+ | 3 | Update README with real numbers + real Space / Colab / GitHub / blog links | ✅ |
86
  | 4 | Deploy HF Space from the same commit | ✅ |
87
+ | 5 | Dashboard upgraded: hero story panel, 4 stacked plots, resources grid with README / blog / checklist links | ✅ |
88
  | 6 | Blog post updated (`docs/BLOG_POST.md`) with fixed image paths (raw GitHub URLs) and 0.5B ablation section | ✅ |
89
  | 7 | All 21 tests passing on latest commit | ✅ |
90
  | 8 | Run `openenv validate` remotely against the Space — `./validate-submission.sh <space-url>` | ⬜ (run it once before the deadline) |
 
120
  | Mini blog post (MD file in the repo, renders on both HF Space and GitHub) | [`docs/BLOG_POST.md`](./BLOG_POST.md) |
121
  | Reproducible training notebook | [Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing) |
122
  | Training evidence (all 4 plots + JSON metrics) | [`artifacts/`](../artifacts) folder |
 
scripts/before_after_demo.py CHANGED
@@ -2,7 +2,7 @@
2
 
3
  Runs both policies against the same task under the same seed, prints a clean
4
  side-by-side trace, and writes ``artifacts/before_after_demo.md`` which you
5
- can paste into the blog post or screen-record for the video.
6
 
7
  Usage (after ``train_trl.py`` has saved ``artifacts/sft_model``)::
8
 
 
2
 
3
  Runs both policies against the same task under the same seed, prints a clean
4
  side-by-side trace, and writes ``artifacts/before_after_demo.md`` which you
5
+ can paste into the blog post or other writeups.
6
 
7
  Usage (after ``train_trl.py`` has saved ``artifacts/sft_model``)::
8
 
server/app.py CHANGED
@@ -38,13 +38,8 @@ from server.domain.reward import (
38
  TIER_MULTIPLIER,
39
  )
40
  from server.environment import IncidentCommandCenterEnvironment
41
- from server import llm_remote
42
  from server.logging_utils import configure_logging
43
 
44
- import re as _re
45
-
46
- _JSON_RE = _re.compile(r"\{[\s\S]*\}")
47
-
48
  _LOG = logging.getLogger("icc.app")
49
  _CONFIG = EnvConfig.from_env()
50
  configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
@@ -61,7 +56,6 @@ COLAB_URL = "https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI
61
  # root for it; the other three open the HF file browser.
62
  README_URL = f"{SPACE_PAGE_URL}/blob/main/README.md"
63
  BLOG_POST_URL = f"{SPACE_PAGE_URL}/blob/main/docs/BLOG_POST.md"
64
- VIDEO_SCRIPT_URL = f"{SPACE_PAGE_URL}/blob/main/docs/VIDEO_SCRIPT.md"
65
  SUBMISSION_CHECKLIST_URL = f"{SPACE_PAGE_URL}/blob/main/docs/SUBMISSION_CHECKLIST.md"
66
 
67
  app = create_fastapi_app(
@@ -178,154 +172,6 @@ async def env_info() -> JSONResponse:
178
  return JSONResponse(_metadata_payload())
179
 
180
 
181
- # ---------------------------------------------------------------------------
182
- # Live LLM inference demo (optional — only enabled when HF credentials set)
183
- # ---------------------------------------------------------------------------
184
-
185
-
186
- def _build_demo_prompt(obs: IncidentObservation) -> str:
187
- """Same prompt format the SFT model was fine-tuned on (train_trl.obs_to_prompt)."""
188
- targets = obs.investigation_targets or {}
189
- return (
190
- "You are operating a multi-agent incident command center. "
191
- "Pick the next action for the appropriate specialist role.\n\n"
192
- f"Incident ID: {obs.incident_id}\n"
193
- f"Title: {obs.incident_title}\n"
194
- f"Description: {obs.incident_description}\n"
195
- f"Customer tier: {obs.customer_tier} | "
196
- f"Affected users: {obs.affected_users_estimate} | "
197
- f"Revenue impact (USD/min): {obs.revenue_impact_usd_per_min}\n"
198
- f"Postmortem required: {obs.postmortem_required}\n"
199
- f"Visible signals: {', '.join(obs.visible_signals or [])}\n"
200
- f"Available log targets: {', '.join(targets.get('logs', []) or [])}\n"
201
- f"Available metric targets: {', '.join(targets.get('metrics', []) or [])}\n"
202
- f"Available KB articles: {', '.join(targets.get('kb', []) or [])}\n"
203
- f"Budget remaining: {obs.budget_remaining} actions | "
204
- f"SLA remaining: {obs.sla_minutes_remaining} min | "
205
- f"Clues found: {obs.clues_found} | "
206
- f"Mitigation applied: {obs.mitigation_applied}\n"
207
- f"Last terminal output: {obs.terminal_output}\n\n"
208
- "Respond with a JSON object containing exactly these keys: "
209
- "actor, action_type, target, root_cause, resolution_summary, "
210
- "postmortem_note, confidence, reason."
211
- )
212
-
213
-
214
- def _parse_llm_action(response_text: str) -> Dict[str, Any]:
215
- """Extract the first balanced JSON object from a model response."""
216
- match = _JSON_RE.search(response_text or "")
217
- if not match:
218
- return {}
219
- raw = match.group(0)
220
- last_close = raw.rfind("}")
221
- if last_close != -1:
222
- raw = raw[: last_close + 1]
223
- try:
224
- return json.loads(raw)
225
- except (json.JSONDecodeError, TypeError):
226
- return {}
227
-
228
-
229
- @app.get("/llm-demo-status", response_class=JSONResponse)
230
- async def llm_demo_status() -> JSONResponse:
231
- """Report whether the live-inference panel is usable (credentials set)."""
232
- return JSONResponse(llm_remote.status_summary())
233
-
234
-
235
- @app.post("/llm-demo", response_class=JSONResponse)
236
- async def llm_demo(payload: Dict[str, Any]) -> JSONResponse:
237
- """Run one live step against the fine-tuned model behind an HF endpoint.
238
-
239
- Spins up a fresh isolated ``IncidentCommandCenterEnvironment`` for each
240
- call so the demo never disturbs the main environment instance that is
241
- answering ``/reset`` and ``/step`` for training clients. Returns the full
242
- trace (observation → prompt → raw LLM text → parsed action → reward) so
243
- judges can see exactly what the model produced.
244
- """
245
- if not llm_remote.is_configured():
246
- return JSONResponse(
247
- {
248
- "error": "Remote LLM not configured on this Space.",
249
- "status": llm_remote.status_summary(),
250
- },
251
- status_code=503,
252
- )
253
-
254
- task_name = str(payload.get("task_name") or "easy").strip()
255
- try:
256
- seed = int(payload.get("seed") or _CONFIG.default_seed)
257
- except (TypeError, ValueError):
258
- seed = _CONFIG.default_seed
259
-
260
- # Isolated env so the live demo never clobbers the shared state.
261
- env = IncidentCommandCenterEnvironment()
262
- obs = env.reset(task_name=task_name, seed=seed)
263
- prompt = _build_demo_prompt(obs)
264
-
265
- try:
266
- raw_response = llm_remote.generate(prompt)
267
- except Exception as exc: # pragma: no cover - network-dependent
268
- return JSONResponse(
269
- {
270
- "error": f"Remote LLM call failed: {exc}",
271
- "status": llm_remote.status_summary(),
272
- },
273
- status_code=502,
274
- )
275
-
276
- parsed_action_dict = _parse_llm_action(raw_response)
277
-
278
- try:
279
- action = IncidentAction(**parsed_action_dict)
280
- parsed_ok = True
281
- except Exception:
282
- logs = (obs.investigation_targets or {}).get("logs", []) or []
283
- fallback_target = logs[0] if logs else "payments-api"
284
- action = IncidentAction(
285
- actor="triage_agent",
286
- action_type="inspect_logs",
287
- target=fallback_target,
288
- reason="Fallback (LLM JSON invalid).",
289
- )
290
- parsed_ok = False
291
-
292
- step_obs = env.step(action)
293
- reward_components = dict(step_obs.reward_components or {})
294
- reward_total = sum(reward_components.values()) if reward_components else 0.0
295
-
296
- return JSONResponse(
297
- {
298
- "task_name": task_name,
299
- "seed": seed,
300
- "observation_before": {
301
- "incident_id": obs.incident_id,
302
- "incident_title": obs.incident_title,
303
- "customer_tier": obs.customer_tier,
304
- "affected_users_estimate": obs.affected_users_estimate,
305
- "revenue_impact_usd_per_min": obs.revenue_impact_usd_per_min,
306
- "visible_signals": obs.visible_signals,
307
- "investigation_targets": obs.investigation_targets,
308
- "budget_remaining": obs.budget_remaining,
309
- "sla_minutes_remaining": obs.sla_minutes_remaining,
310
- },
311
- "prompt": prompt,
312
- "raw_llm_response": raw_response,
313
- "parsed_action": parsed_action_dict,
314
- "validated_action": action.model_dump(exclude_none=True),
315
- "fallback_used": not parsed_ok,
316
- "step_result": {
317
- "reward_total": round(reward_total, 4),
318
- "reward_components": {
319
- k: round(v, 4) for k, v in reward_components.items()
320
- },
321
- "done": bool(step_obs.done),
322
- "terminal_output": step_obs.terminal_output,
323
- "last_action_notes": list(step_obs.last_action_notes or []),
324
- },
325
- }
326
- )
327
-
328
-
329
  @app.get("/metrics", response_class=PlainTextResponse)
330
  async def metrics() -> PlainTextResponse:
331
  env = _resolve_environment()
@@ -479,81 +325,6 @@ def _dashboard_html() -> str:
479
  # so the existing `{themes_html}` slot renders to nothing (no duplication).
480
  themes_html = ""
481
 
482
- # --- Live inference panel (only shown when HF credentials set) ----------
483
- llm_status = llm_remote.status_summary()
484
- if llm_status.get("configured"):
485
- live_panel_html = f"""
486
- <h2>Try the fine-tuned model live</h2>
487
- <div class='card'>
488
- <p class='sub'>
489
- Spin up an isolated episode and watch the <strong>fine-tuned SFT model</strong>
490
- pick the next action in real time. The prompt below is the exact format
491
- used during training, so you can see how the model transforms a raw
492
- observation into a typed <code>IncidentAction</code> — and the
493
- environment's reward response.
494
- </p>
495
- <div class='live-controls'>
496
- <label>Task
497
- <select id='live-task'>
498
- <option value='easy'>easy</option>
499
- <option value='medium'>medium</option>
500
- <option value='hard' selected>hard</option>
501
- </select>
502
- </label>
503
- <label>Seed
504
- <input id='live-seed' type='number' value='42' min='0' step='1' />
505
- </label>
506
- <button id='live-run' class='pill cta'>▶ Run one step</button>
507
- <span id='live-status' class='sub'>Endpoint: {llm_status.get('host', '—')} · mode: {llm_status.get('mode', 'chat')}</span>
508
- </div>
509
- <div id='live-output' class='live-output' hidden>
510
- <div class='live-grid'>
511
- <div>
512
- <h4>Observation (before)</h4>
513
- <pre id='live-obs-before'></pre>
514
- </div>
515
- <div>
516
- <h4>Prompt sent to model</h4>
517
- <pre id='live-prompt'></pre>
518
- </div>
519
- <div>
520
- <h4>Raw LLM response</h4>
521
- <pre id='live-raw'></pre>
522
- </div>
523
- <div>
524
- <h4>Parsed &amp; validated action</h4>
525
- <pre id='live-action'></pre>
526
- </div>
527
- <div class='live-grid-full'>
528
- <h4>Environment step result</h4>
529
- <pre id='live-step'></pre>
530
- </div>
531
- </div>
532
- </div>
533
- <div id='live-error' class='live-error' hidden></div>
534
- </div>
535
- """
536
- else:
537
- live_panel_html = f"""
538
- <h2>Try the fine-tuned model live</h2>
539
- <div class='card'>
540
- <p class='sub'>
541
- <strong>Optional bonus panel.</strong> This Space can stream the
542
- fine-tuned SFT model's decisions in real time when a Hugging Face
543
- Inference Endpoint is attached. {llm_status.get('reason', '')}
544
- </p>
545
- <details>
546
- <summary class='sub'>How the owner enables it</summary>
547
- <ol>
548
- <li>Upload the SFT checkpoint from <code>artifacts/sft_model/</code> to a model repo on the Hub.</li>
549
- <li>Create a dedicated <a href='https://huggingface.co/inference-endpoints' target='_blank' rel='noopener'>Inference Endpoint</a> (T4 small is enough).</li>
550
- <li>Set <code>LLM_ENDPOINT_URL</code> and <code>HF_TOKEN</code> as secrets on this Space.</li>
551
- <li>Restart the Space — this panel turns on automatically.</li>
552
- </ol>
553
- </details>
554
- </div>
555
- """
556
-
557
  # --- Reward-rubric details ----------------------------------------------
558
  reward_rubric_rows = "".join(
559
  f"<tr><td><code>{name}</code></td><td>{value}</td></tr>"
@@ -630,40 +401,6 @@ def _dashboard_html() -> str:
630
  td.delta.good {{ color: var(--good); }}
631
  .links {{ display:flex; flex-wrap:wrap; gap:0.5rem; }}
632
 
633
- /* Live-inference panel (fine-tuned SFT model behind HF Inference Endpoint). */
634
- .live-controls {{
635
- display:flex; flex-wrap:wrap; gap:1rem; align-items:center;
636
- margin:0.75rem 0 1rem;
637
- }}
638
- .live-controls label {{
639
- display:flex; flex-direction:column; gap:0.2rem;
640
- font-size:0.8rem; color:var(--muted);
641
- }}
642
- .live-controls select, .live-controls input {{
643
- background:#0b1225; border:1px solid #1f2a44; color:var(--text);
644
- border-radius:8px; padding:0.35rem 0.55rem; font-size:0.9rem; min-width:110px;
645
- }}
646
- .live-controls button.pill.cta {{ cursor:pointer; border:0; }}
647
- .live-controls button.pill.cta:disabled {{ opacity:0.6; cursor:wait; }}
648
- .live-grid {{
649
- display:grid; grid-template-columns: repeat(auto-fit, minmax(360px, 1fr));
650
- gap:0.9rem; margin-top:0.5rem;
651
- }}
652
- .live-grid h4 {{
653
- margin:0 0 0.3rem; font-size:0.85rem; color:#cbd5e1;
654
- text-transform:uppercase; letter-spacing:0.04em;
655
- }}
656
- .live-grid .live-grid-full {{ grid-column: 1 / -1; }}
657
- .live-grid pre {{
658
- background:#0b1225; border:1px solid #1f2a44; border-radius:10px;
659
- padding:0.75rem; margin:0; font-size:0.82rem; line-height:1.45;
660
- max-height:320px; overflow:auto; white-space:pre-wrap; word-wrap:break-word;
661
- }}
662
- .live-error {{
663
- background:#2a1418; border:1px solid #ef444455; color:#fca5a5;
664
- border-radius:10px; padding:0.75rem; margin-top:0.75rem; font-size:0.9rem;
665
- }}
666
-
667
  /* "Story in 2 minutes" hero panel — plain-English summary for judges. */
668
  .hero-card {{
669
  background: linear-gradient(135deg, #0f2647 0%, #172a4a 60%, #1f2a44 100%);
@@ -739,8 +476,7 @@ def _dashboard_html() -> str:
739
  <h3 style='margin-top:1.25rem'>What is the environment?</h3>
740
  <p class='sub' style='margin:0 0 0.75rem'>
741
  Three specialist agents with <strong>different permissions</strong> resolve
742
- a live queue drawn from <strong>30 realistic tech incident templates</strong>
743
- across 3 difficulty tiers.
744
  </p>
745
  <div class='table-wrap'>
746
  <table>
@@ -847,11 +583,6 @@ def _dashboard_html() -> str:
847
  <div class='res-title'>Mini blog post</div>
848
  <div class='sub'>The short writeup — MD file on the HF Space + GitHub</div>
849
  </a>
850
- <a class='res-card' href='{VIDEO_SCRIPT_URL}' target='_blank' rel='noopener'>
851
- <div class='res-icon'>🎬</div>
852
- <div class='res-title'>2-minute video script</div>
853
- <div class='sub'>Optional bonus — shot list + narration</div>
854
- </a>
855
  <a class='res-card' href='{SUBMISSION_CHECKLIST_URL}' target='_blank' rel='noopener'>
856
  <div class='res-icon'>✅</div>
857
  <div class='res-title'>Submission checklist</div>
@@ -947,8 +678,6 @@ def _dashboard_html() -> str:
947
 
948
  {ablation_html}
949
 
950
- {live_panel_html}
951
-
952
  {themes_html}
953
 
954
  <h2>Endpoints</h2>
@@ -1017,7 +746,6 @@ def _dashboard_html() -> str:
1017
  <a href='{COLAB_URL}' target='_blank' rel='noopener'>Colab</a> ·
1018
  <a href='{README_URL}' target='_blank' rel='noopener'>README</a> ·
1019
  <a href='{BLOG_POST_URL}' target='_blank' rel='noopener'>Blog post</a> ·
1020
- <a href='{VIDEO_SCRIPT_URL}' target='_blank' rel='noopener'>Video script</a> ·
1021
  <a href='{SUBMISSION_CHECKLIST_URL}' target='_blank' rel='noopener'>Submission checklist</a>
1022
  </div>
1023
  </footer>
@@ -1028,68 +756,6 @@ def _dashboard_html() -> str:
1028
  const total = Object.values(data.incidents_per_task || {{}}).reduce((a,b)=>a+b,0);
1029
  document.getElementById('kpi-inc').textContent = total;
1030
  }} catch (e) {{}}
1031
-
1032
- // Live fine-tuned-model demo. Only runs if the panel is rendered.
1033
- (function() {{
1034
- const runBtn = document.getElementById('live-run');
1035
- if (!runBtn) return;
1036
-
1037
- const taskSel = document.getElementById('live-task');
1038
- const seedInp = document.getElementById('live-seed');
1039
- const out = document.getElementById('live-output');
1040
- const err = document.getElementById('live-error');
1041
- const obsPre = document.getElementById('live-obs-before');
1042
- const promptPre = document.getElementById('live-prompt');
1043
- const rawPre = document.getElementById('live-raw');
1044
- const actPre = document.getElementById('live-action');
1045
- const stepPre = document.getElementById('live-step');
1046
-
1047
- function showError(msg) {{
1048
- err.textContent = msg;
1049
- err.hidden = false;
1050
- out.hidden = true;
1051
- }}
1052
-
1053
- function renderOutput(data) {{
1054
- err.hidden = true;
1055
- obsPre.textContent = JSON.stringify(data.observation_before || {{}}, null, 2);
1056
- promptPre.textContent = data.prompt || '';
1057
- rawPre.textContent = data.raw_llm_response || '(empty response)';
1058
- const fallbackTag = data.fallback_used
1059
- ? '// NOTE: LLM JSON was invalid — safe fallback action was used instead.\\n'
1060
- : '';
1061
- actPre.textContent = fallbackTag + JSON.stringify(data.validated_action || {{}}, null, 2);
1062
- stepPre.textContent = JSON.stringify(data.step_result || {{}}, null, 2);
1063
- out.hidden = false;
1064
- }}
1065
-
1066
- runBtn.addEventListener('click', async () => {{
1067
- runBtn.disabled = true;
1068
- const label = runBtn.textContent;
1069
- runBtn.textContent = '⏳ Calling model…';
1070
- try {{
1071
- const resp = await fetch('/llm-demo', {{
1072
- method: 'POST',
1073
- headers: {{'Content-Type': 'application/json'}},
1074
- body: JSON.stringify({{
1075
- task_name: taskSel.value,
1076
- seed: Number(seedInp.value) || 0
1077
- }})
1078
- }});
1079
- const data = await resp.json();
1080
- if (!resp.ok) {{
1081
- showError((data && data.error) ? data.error : ('HTTP ' + resp.status));
1082
- }} else {{
1083
- renderOutput(data);
1084
- }}
1085
- }} catch (e) {{
1086
- showError('Network error: ' + e.message);
1087
- }} finally {{
1088
- runBtn.disabled = false;
1089
- runBtn.textContent = label;
1090
- }}
1091
- }});
1092
- }})();
1093
  </script>
1094
  </body>
1095
  </html>
 
38
  TIER_MULTIPLIER,
39
  )
40
  from server.environment import IncidentCommandCenterEnvironment
 
41
  from server.logging_utils import configure_logging
42
 
 
 
 
 
43
  _LOG = logging.getLogger("icc.app")
44
  _CONFIG = EnvConfig.from_env()
45
  configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
 
56
  # root for it; the other three open the HF file browser.
57
  README_URL = f"{SPACE_PAGE_URL}/blob/main/README.md"
58
  BLOG_POST_URL = f"{SPACE_PAGE_URL}/blob/main/docs/BLOG_POST.md"
 
59
  SUBMISSION_CHECKLIST_URL = f"{SPACE_PAGE_URL}/blob/main/docs/SUBMISSION_CHECKLIST.md"
60
 
61
  app = create_fastapi_app(
 
172
  return JSONResponse(_metadata_payload())
173
 
174
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
  @app.get("/metrics", response_class=PlainTextResponse)
176
  async def metrics() -> PlainTextResponse:
177
  env = _resolve_environment()
 
325
  # so the existing `{themes_html}` slot renders to nothing (no duplication).
326
  themes_html = ""
327
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
328
  # --- Reward-rubric details ----------------------------------------------
329
  reward_rubric_rows = "".join(
330
  f"<tr><td><code>{name}</code></td><td>{value}</td></tr>"
 
401
  td.delta.good {{ color: var(--good); }}
402
  .links {{ display:flex; flex-wrap:wrap; gap:0.5rem; }}
403
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
404
  /* "Story in 2 minutes" hero panel — plain-English summary for judges. */
405
  .hero-card {{
406
  background: linear-gradient(135deg, #0f2647 0%, #172a4a 60%, #1f2a44 100%);
 
476
  <h3 style='margin-top:1.25rem'>What is the environment?</h3>
477
  <p class='sub' style='margin:0 0 0.75rem'>
478
  Three specialist agents with <strong>different permissions</strong> resolve
479
+ a live queue of 13 realistic tech incidents across 3 difficulty tiers.
 
480
  </p>
481
  <div class='table-wrap'>
482
  <table>
 
583
  <div class='res-title'>Mini blog post</div>
584
  <div class='sub'>The short writeup — MD file on the HF Space + GitHub</div>
585
  </a>
 
 
 
 
 
586
  <a class='res-card' href='{SUBMISSION_CHECKLIST_URL}' target='_blank' rel='noopener'>
587
  <div class='res-icon'>✅</div>
588
  <div class='res-title'>Submission checklist</div>
 
678
 
679
  {ablation_html}
680
 
 
 
681
  {themes_html}
682
 
683
  <h2>Endpoints</h2>
 
746
  <a href='{COLAB_URL}' target='_blank' rel='noopener'>Colab</a> ·
747
  <a href='{README_URL}' target='_blank' rel='noopener'>README</a> ·
748
  <a href='{BLOG_POST_URL}' target='_blank' rel='noopener'>Blog post</a> ·
 
749
  <a href='{SUBMISSION_CHECKLIST_URL}' target='_blank' rel='noopener'>Submission checklist</a>
750
  </div>
751
  </footer>
 
756
  const total = Object.values(data.incidents_per_task || {{}}).reduce((a,b)=>a+b,0);
757
  document.getElementById('kpi-inc').textContent = total;
758
  }} catch (e) {{}}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
759
  </script>
760
  </body>
761
  </html>
server/domain/incidents.py CHANGED
@@ -850,885 +850,17 @@ def _deadlock_database() -> IncidentTemplate:
850
  )
851
 
852
 
853
- # ---------------------------------------------------------------------------
854
- # Extended catalog (round-2 polish)
855
- #
856
- # 17 additional templates balance the tier mix (free / standard / premium /
857
- # enterprise), add new service dimensions (DNS, CDN, ML inference, storage,
858
- # message queue, config distribution) and new failure modes (GPU memory leaks,
859
- # replication saturation, cache key collisions, firmware regressions, DST
860
- # bugs). Each template follows the same pattern as INC-E1..H5 so the reward
861
- # rubric, environment plumbing and training scripts require no changes.
862
- # ---------------------------------------------------------------------------
863
-
864
-
865
- def _dns_ttl_stale() -> IncidentTemplate:
866
- return IncidentTemplate(
867
- id="INC-E4",
868
- title="Stale DNS routes free-tier API traffic to drained region",
869
- description=(
870
- "Free-tier API callers keep hitting a drained region even after "
871
- "a planned failover because DNS TTLs have not expired."
872
- ),
873
- category="networking",
874
- difficulty="easy",
875
- root_cause="dns_ttl_stale_after_failover",
876
- root_cause_synonyms=(
877
- "dns ttl stale after failover",
878
- "stale dns record",
879
- "long ttl blocking failover",
880
- ),
881
- clue_keywords=("dns", "ttl", "failover", "drain"),
882
- signals=(
883
- "Traffic ratio to drained region stays above 30% 30 minutes post-failover",
884
- "Only free-tier resolvers (no Anycast) are affected",
885
- ),
886
- logs={
887
- "dns-edge": "A record TTL=3600s still cached at regional resolvers",
888
- "traffic-router": "Residual traffic observed on drained region us-west-2b",
889
- },
890
- red_herring_logs={
891
- "payments-api": "steady 2xx",
892
- },
893
- metrics={
894
- "dash-dns": "ttl_expired_ratio 0.71 (expected >0.95)",
895
- "dash-router": "drained_region_share 34%",
896
- },
897
- red_herring_metrics={
898
- "dash-cdn": "hit_ratio 95%",
899
- },
900
- kb={
901
- "kb-dns-ttl": "Pre-lower TTL to 60s at least 2 TTLs before planned failovers.",
902
- },
903
- good_handoff="triage_agent",
904
- accepted_fix_keywords=(
905
- ("shorten", "dns", "ttl"),
906
- ("force", "resolver", "refresh"),
907
- ("rollback", "region", "drain"),
908
- ),
909
- required_investigations=1,
910
- customer_tier="free",
911
- affected_users_estimate=2_500,
912
- revenue_impact_usd_per_min=15,
913
- requires_mitigation=True,
914
- )
915
-
916
-
917
- def _cdn_purge_scope() -> IncidentTemplate:
918
- return IncidentTemplate(
919
- id="INC-E5",
920
- title="CDN purge missed a hot asset after release",
921
- description=(
922
- "A marketing banner refresh missed a subset of CDN edges, so a "
923
- "fraction of standard-tier users see the old creative."
924
- ),
925
- category="cdn",
926
- difficulty="easy",
927
- root_cause="cdn_purge_scope_mismatch",
928
- root_cause_synonyms=(
929
- "cdn purge scope mismatch",
930
- "edge purge partial",
931
- "shield purge missed",
932
- ),
933
- clue_keywords=("cdn", "purge", "edge", "shield"),
934
- signals=(
935
- "Small but persistent share of stale banner impressions",
936
- "Affected edges cluster on a single PoP provider",
937
- ),
938
- logs={
939
- "cdn-control-plane": "Purge job completed with 14 edges skipped (policy=legacy)",
940
- "edge-pop-bom-1": "Serving banner_v12 while origin is on banner_v13",
941
- },
942
- metrics={
943
- "dash-cdn": "stale_object_rate 1.4%, edge_sync_lag_s 312",
944
- },
945
- red_herring_metrics={
946
- "dash-auth": "401_rate 0.2%",
947
- },
948
- kb={
949
- "kb-cdn-purge": "Always use wildcard purge with full edge fanout for visual assets.",
950
- },
951
- good_handoff="investigator_agent",
952
- accepted_fix_keywords=(
953
- ("reissue", "cdn", "purge"),
954
- ("fanout", "edge", "invalidation"),
955
- ("rotate", "asset", "hash"),
956
- ),
957
- required_investigations=1,
958
- customer_tier="standard",
959
- affected_users_estimate=11_000,
960
- revenue_impact_usd_per_min=60,
961
- requires_mitigation=True,
962
- )
963
-
964
-
965
- def _autocomplete_stale() -> IncidentTemplate:
966
- return IncidentTemplate(
967
- id="INC-E6",
968
- title="Search autocomplete missing this week's products",
969
- description=(
970
- "Free-tier shoppers see a stale autocomplete list that does not "
971
- "surface new SKUs released this Monday."
972
- ),
973
- category="search",
974
- difficulty="easy",
975
- root_cause="autocomplete_index_rebuild_skipped",
976
- root_cause_synonyms=(
977
- "autocomplete index rebuild skipped",
978
- "suggestion index stale",
979
- "nightly reindex missed",
980
- ),
981
- clue_keywords=("autocomplete", "index", "reindex", "suggestion"),
982
- signals=(
983
- "New SKUs launched Monday never appear in suggest responses",
984
- "Full text search returns them correctly",
985
- ),
986
- logs={
987
- "suggest-indexer": "Scheduled rebuild skipped (upstream lock held)",
988
- "suggest-api": "Serving snapshot v88 (expected v91)",
989
- },
990
- red_herring_logs={
991
- "payments-api": "steady 2xx",
992
- },
993
- metrics={
994
- "dash-suggest": "index_version 88, target_version 91",
995
- "dash-search": "full_text_recall 99%, autocomplete_recall 71%",
996
- },
997
- kb={
998
- "kb-autocomplete": "Reindex lock must release on job exit and alert on missed window.",
999
- },
1000
- good_handoff="ops_manager_agent",
1001
- accepted_fix_keywords=(
1002
- ("force", "index", "rebuild"),
1003
- ("release", "reindex", "lock"),
1004
- ("promote", "suggestion", "snapshot"),
1005
- ),
1006
- required_investigations=1,
1007
- customer_tier="free",
1008
- affected_users_estimate=18_000,
1009
- revenue_impact_usd_per_min=30,
1010
- requires_mitigation=True,
1011
- )
1012
-
1013
-
1014
- def _webhook_retry_budget() -> IncidentTemplate:
1015
- return IncidentTemplate(
1016
- id="INC-E7",
1017
- title="Partner webhooks silently dropping",
1018
- description=(
1019
- "A handful of partner integrations stopped receiving webhook "
1020
- "deliveries after a downstream 429 spike."
1021
- ),
1022
- category="integrations",
1023
- difficulty="easy",
1024
- root_cause="webhook_retry_budget_exhausted",
1025
- root_cause_synonyms=(
1026
- "webhook retry budget exhausted",
1027
- "partner webhook giving up",
1028
- "429 retry exhaustion",
1029
- ),
1030
- clue_keywords=("webhook", "retry", "429", "budget"),
1031
- signals=(
1032
- "Deliveries succeed for some partners and silently fail for others",
1033
- "Affected partners all share a single rate-limit bucket",
1034
- ),
1035
- logs={
1036
- "webhook-dispatcher": "Retry budget exhausted for partner_bucket=bucket-7",
1037
- "partner-gateway": "HTTP 429 for 22 consecutive attempts on bucket-7",
1038
- },
1039
- red_herring_logs={
1040
- "catalog-api": "steady 2xx",
1041
- },
1042
- metrics={
1043
- "dash-webhooks": "delivery_success_bucket7 34%, retry_budget_remaining 0",
1044
- },
1045
- kb={
1046
- "kb-webhook-retry": "Split rate-limit buckets per partner and reset retry budgets on recovery.",
1047
- },
1048
- good_handoff="ops_manager_agent",
1049
- accepted_fix_keywords=(
1050
- ("split", "retry", "bucket"),
1051
- ("reset", "retry", "budget"),
1052
- ("pause", "partner", "bucket"),
1053
- ),
1054
- required_investigations=2,
1055
- customer_tier="standard",
1056
- affected_users_estimate=1_400,
1057
- revenue_impact_usd_per_min=80,
1058
- requires_mitigation=True,
1059
- )
1060
-
1061
-
1062
- def _thumbnail_worker_oom() -> IncidentTemplate:
1063
- return IncidentTemplate(
1064
- id="INC-E8",
1065
- title="User profile thumbnails render blank on mobile",
1066
- description=(
1067
- "Free-tier mobile users see empty circles where their profile "
1068
- "photo should appear, intermittently."
1069
- ),
1070
- category="media",
1071
- difficulty="easy",
1072
- root_cause="thumbnail_worker_oom_killed",
1073
- root_cause_synonyms=(
1074
- "thumbnail worker oom killed",
1075
- "image worker out of memory",
1076
- "thumbnailer oom loop",
1077
- ),
1078
- clue_keywords=("thumbnail", "oom", "memory", "worker"),
1079
- signals=(
1080
- "Missing thumbnails correlate with HEIC uploads from newer devices",
1081
- "CPU is normal but worker restart count is spiking",
1082
- ),
1083
- logs={
1084
- "thumbnail-worker": "SIGKILL received (oom_score_adj=500)",
1085
- "image-pipeline": "HEIC decoder peak rss 1.9GB on large uploads",
1086
- },
1087
- metrics={
1088
- "dash-thumbnails": "render_success 82%, worker_restarts 240/hr",
1089
- "dash-k8s": "pod_oom_kill_count 42",
1090
- },
1091
- kb={
1092
- "kb-thumbnail": "Cap HEIC decode memory or reject above 30MP at the edge.",
1093
- },
1094
- good_handoff="triage_agent",
1095
- accepted_fix_keywords=(
1096
- ("raise", "memory", "limit"),
1097
- ("reject", "oversized", "heic"),
1098
- ("downscale", "before", "decode"),
1099
- ),
1100
- required_investigations=2,
1101
- customer_tier="free",
1102
- affected_users_estimate=55_000,
1103
- revenue_impact_usd_per_min=20,
1104
- requires_mitigation=True,
1105
- )
1106
-
1107
-
1108
- def _recommender_heap_leak() -> IncidentTemplate:
1109
- return IncidentTemplate(
1110
- id="INC-M6",
1111
- title="Recommender latency drifts up after model swap",
1112
- description=(
1113
- "Homepage recommendation latency is drifting up over six hours "
1114
- "since this morning's model swap. p99 is now 2.1s."
1115
- ),
1116
- category="recommendations",
1117
- difficulty="medium",
1118
- root_cause="recommender_heap_leak_after_model_swap",
1119
- root_cause_synonyms=(
1120
- "recommender heap leak after model swap",
1121
- "embedding cache not released",
1122
- "old model tensors pinned",
1123
- ),
1124
- clue_keywords=("heap", "leak", "embedding", "model", "swap"),
1125
- signals=(
1126
- "Heap utilisation climbs 2% / hour since deploy",
1127
- "Full GC frequency doubled but does not recover memory",
1128
- ),
1129
- logs={
1130
- "recommender-service": "Loaded model v42; previous tensors not released",
1131
- "jvm-gc": "Old gen occupancy 88% after full GC",
1132
- },
1133
- red_herring_logs={
1134
- "catalog-api": "steady 2xx",
1135
- },
1136
- metrics={
1137
- "dash-recommender": "p99_latency_ms 2100, heap_used_pct 88",
1138
- "dash-jvm": "full_gc_per_min 4, reclaimed_bytes_low",
1139
- },
1140
- red_herring_metrics={
1141
- "dash-search": "ctr steady",
1142
- },
1143
- kb={
1144
- "kb-model-swap": "Release previous model tensors explicitly before binding the new one.",
1145
- },
1146
- good_handoff="investigator_agent",
1147
- accepted_fix_keywords=(
1148
- ("release", "previous", "model"),
1149
- ("unload", "embedding", "cache"),
1150
- ("rollback", "model", "swap"),
1151
- ),
1152
- required_investigations=2,
1153
- customer_tier="premium",
1154
- affected_users_estimate=95_000,
1155
- revenue_impact_usd_per_min=410,
1156
- requires_mitigation=True,
1157
- )
1158
-
1159
-
1160
- def _consumer_group_rebalance() -> IncidentTemplate:
1161
- return IncidentTemplate(
1162
- id="INC-M7",
1163
- title="Order events stuck behind consumer rebalance storm",
1164
- description=(
1165
- "Order processing lag spiked after a rolling restart and has not "
1166
- "recovered; fresh orders are 90s behind real time."
1167
- ),
1168
- category="messaging",
1169
- difficulty="medium",
1170
- root_cause="consumer_group_rebalance_storm",
1171
- root_cause_synonyms=(
1172
- "consumer group rebalance storm",
1173
- "kafka consumer thrashing",
1174
- "repeated partition reassignment",
1175
- ),
1176
- clue_keywords=("kafka", "consumer", "rebalance", "partition"),
1177
- signals=(
1178
- "Consumer group rebalanced 11 times in 5 minutes",
1179
- "Lag stuck even though CPU is at 30%",
1180
- ),
1181
- logs={
1182
- "order-consumer": "Rebalance triggered: member id rotated, session timeout=10s",
1183
- "kafka-coordinator": "Generation 412 -> 423 in 5m, partitions churning",
1184
- },
1185
- red_herring_logs={
1186
- "auth-service": "normal 2xx",
1187
- },
1188
- metrics={
1189
- "dash-orders": "consumer_lag 90s, rebalance_count_5m 11",
1190
- "dash-kafka": "generation_rotations 2.2/min",
1191
- },
1192
- kb={
1193
- "kb-consumer-tuning": "Raise session.timeout.ms and heartbeat.interval.ms to avoid false expulsion.",
1194
- },
1195
- good_handoff="ops_manager_agent",
1196
- accepted_fix_keywords=(
1197
- ("raise", "session", "timeout"),
1198
- ("pin", "static", "membership"),
1199
- ("stabilise", "consumer", "group"),
1200
- ),
1201
- required_investigations=2,
1202
- customer_tier="premium",
1203
- affected_users_estimate=48_000,
1204
- revenue_impact_usd_per_min=520,
1205
- requires_mitigation=True,
1206
- )
1207
-
1208
-
1209
- def _config_push_skipped_canary() -> IncidentTemplate:
1210
- return IncidentTemplate(
1211
- id="INC-M8",
1212
- title="Enterprise tenants hit TLS verify failures after config push",
1213
- description=(
1214
- "A global config change flipped a TLS verification flag in "
1215
- "production without going through canary."
1216
- ),
1217
- category="platform",
1218
- difficulty="medium",
1219
- root_cause="config_push_skipped_canary",
1220
- root_cause_synonyms=(
1221
- "config push skipped canary",
1222
- "global config bypassed stage",
1223
- "bulk config rollout regression",
1224
- ),
1225
- clue_keywords=("config", "canary", "push", "rollout"),
1226
- signals=(
1227
- "Enterprise tenants see TLS verify errors 3 minutes after deploy",
1228
- "Canary stage shows zero traffic for this change",
1229
- ),
1230
- logs={
1231
- "config-service": "Changeset CR-8812 applied globally (stages=[])",
1232
- "api-gateway": "TLS verify flag=strict caused downstream handshake failures",
1233
- },
1234
- red_herring_logs={
1235
- "email-service": "no anomalies",
1236
- },
1237
- metrics={
1238
- "dash-config": "canary_coverage 0%, rollout_surface 100%",
1239
- "dash-gateway": "tls_verify_failures 8.3%",
1240
- },
1241
- kb={
1242
- "kb-config-rollout": "Require canary + 15 minutes bake before promoting config changes.",
1243
- },
1244
- good_handoff="ops_manager_agent",
1245
- accepted_fix_keywords=(
1246
- ("rollback", "config", "change"),
1247
- ("re-enable", "canary", "stage"),
1248
- ("revert", "tls", "flag"),
1249
- ),
1250
- required_investigations=2,
1251
- customer_tier="enterprise",
1252
- affected_users_estimate=2_100,
1253
- revenue_impact_usd_per_min=640,
1254
- requires_mitigation=True,
1255
- postmortem_required=True,
1256
- )
1257
-
1258
-
1259
- def _health_check_flapping() -> IncidentTemplate:
1260
- return IncidentTemplate(
1261
- id="INC-M9",
1262
- title="Autoscaler thrashing under brief latency blips",
1263
- description=(
1264
- "Autoscaler is adding and removing pods every 2 minutes in "
1265
- "response to very short latency blips."
1266
- ),
1267
- category="platform",
1268
- difficulty="medium",
1269
- root_cause="health_check_timeout_too_aggressive",
1270
- root_cause_synonyms=(
1271
- "health check timeout too aggressive",
1272
- "liveness probe too tight",
1273
- "autoscaler oscillating",
1274
- ),
1275
- clue_keywords=("health", "check", "liveness", "autoscaler"),
1276
- signals=(
1277
- "Pod churn 6x baseline with no underlying load change",
1278
- "Brief p99 blips align with scale events, not incidents",
1279
- ),
1280
- logs={
1281
- "kubelet": "Liveness probe failed: HTTP 500 after 800ms",
1282
- "autoscaler": "Scale up triggered; 3 pods added, 2 removed within 2m",
1283
- },
1284
- red_herring_logs={
1285
- "payments-api": "steady 2xx",
1286
- },
1287
- metrics={
1288
- "dash-k8s": "pod_churn_per_min 9, cpu_avg 42%",
1289
- "dash-slo": "p99_latency_ms spikes tied to scale events",
1290
- },
1291
- kb={
1292
- "kb-health-probe": "Raise liveness timeout and stagger readiness to avoid flap-driven scale events.",
1293
- },
1294
- good_handoff="triage_agent",
1295
- accepted_fix_keywords=(
1296
- ("raise", "probe", "timeout"),
1297
- ("dampen", "autoscaler", "cooldown"),
1298
- ("relax", "liveness", "threshold"),
1299
- ),
1300
- required_investigations=2,
1301
- customer_tier="standard",
1302
- affected_users_estimate=31_000,
1303
- revenue_impact_usd_per_min=210,
1304
- requires_mitigation=True,
1305
- )
1306
-
1307
-
1308
- def _payment_webhook_dedupe() -> IncidentTemplate:
1309
- return IncidentTemplate(
1310
- id="INC-M10",
1311
- title="Payment confirmations delivered twice to enterprise partners",
1312
- description=(
1313
- "Two enterprise payment partners received the same confirmation "
1314
- "webhook twice for a subset of transactions."
1315
- ),
1316
- category="payments",
1317
- difficulty="medium",
1318
- root_cause="webhook_dedupe_window_too_narrow",
1319
- root_cause_synonyms=(
1320
- "webhook dedupe window too narrow",
1321
- "payment webhook duplicate delivery",
1322
- "idempotency window clock drift",
1323
- ),
1324
- clue_keywords=("webhook", "dedupe", "idempotency", "window"),
1325
- signals=(
1326
- "Duplicates concentrated on retries across failover boundary",
1327
- "Dedupe cache TTL is shorter than retry backoff",
1328
- ),
1329
- logs={
1330
- "payments-webhook": "Duplicate delivery for txn T-332a after dedupe cache eviction",
1331
- "scheduler": "Retry backoff 90s; dedupe ttl=60s",
1332
- },
1333
- red_herring_logs={
1334
- "email-service": "steady",
1335
- },
1336
- metrics={
1337
- "dash-payments": "duplicate_webhook_rate 0.9%, dedupe_hit_rate 88%",
1338
- },
1339
- kb={
1340
- "kb-webhook-dedupe": "Dedupe TTL must exceed the maximum retry backoff window.",
1341
- },
1342
- good_handoff="investigator_agent",
1343
- accepted_fix_keywords=(
1344
- ("extend", "dedupe", "ttl"),
1345
- ("shrink", "retry", "backoff"),
1346
- ("persist", "dedupe", "store"),
1347
- ),
1348
- required_investigations=2,
1349
- customer_tier="enterprise",
1350
- affected_users_estimate=620,
1351
- revenue_impact_usd_per_min=480,
1352
- requires_mitigation=True,
1353
- postmortem_required=True,
1354
- )
1355
-
1356
-
1357
- def _origin_shield_bypass() -> IncidentTemplate:
1358
- return IncidentTemplate(
1359
- id="INC-M11",
1360
- title="Origin overloaded after CDN policy change",
1361
- description=(
1362
- "Origin servers are seeing 5x normal traffic because a CDN "
1363
- "policy change disabled origin shield for a large segment."
1364
- ),
1365
- category="cdn",
1366
- difficulty="medium",
1367
- root_cause="origin_shield_bypass_after_policy_change",
1368
- root_cause_synonyms=(
1369
- "origin shield bypass after policy change",
1370
- "shield disabled for segment",
1371
- "cache hierarchy collapsed",
1372
- ),
1373
- clue_keywords=("origin", "shield", "cdn", "policy"),
1374
- signals=(
1375
- "Origin 5xx rate climbs as CDN hit ratio collapses",
1376
- "New CDN policy rolled out exactly at fault onset",
1377
- ),
1378
- logs={
1379
- "cdn-policy": "Policy v5 removed shield targeting for premium segment",
1380
- "origin-lb": "Connection queue depth spiking 5x baseline",
1381
- },
1382
- red_herring_logs={
1383
- "dns-resolver": "no anomalies",
1384
- },
1385
- metrics={
1386
- "dash-cdn": "hit_ratio 67% (baseline 94%)",
1387
- "dash-origin": "rps 5.2x baseline, 5xx_rate 7.1%",
1388
- },
1389
- kb={
1390
- "kb-origin-shield": "Changes to shield routing must go through shadow traffic before promotion.",
1391
- },
1392
- good_handoff="investigator_agent",
1393
- accepted_fix_keywords=(
1394
- ("rollback", "cdn", "policy"),
1395
- ("re-enable", "origin", "shield"),
1396
- ("route", "through", "shield"),
1397
- ),
1398
- required_investigations=3,
1399
- customer_tier="premium",
1400
- affected_users_estimate=240_000,
1401
- revenue_impact_usd_per_min=1_300,
1402
- requires_mitigation=True,
1403
- postmortem_required=True,
1404
- )
1405
-
1406
-
1407
- def _gpu_memory_fragmentation() -> IncidentTemplate:
1408
- return IncidentTemplate(
1409
- id="INC-H6",
1410
- title="LLM inference latency drifts up on production A100 pool",
1411
- description=(
1412
- "Enterprise API latency for the inference gateway has drifted "
1413
- "from 420ms to 1.4s over 36 hours, with OOMs on larger prompts."
1414
- ),
1415
- category="ml_inference",
1416
- difficulty="hard",
1417
- root_cause="gpu_memory_fragmentation_after_prompt_schema_change",
1418
- root_cause_synonyms=(
1419
- "gpu memory fragmentation after prompt schema change",
1420
- "kv cache fragmentation",
1421
- "inference pool memory fragmentation",
1422
- ),
1423
- clue_keywords=("gpu", "memory", "fragmentation", "kv", "cache"),
1424
- signals=(
1425
- "Free VRAM fragmented into small blocks even though total free > 18GB",
1426
- "OOM errors concentrate on prompts >2k tokens",
1427
- ),
1428
- logs={
1429
- "inference-gateway": "CUDA OOM despite torch reports 18GB free; fragmentation detected",
1430
- "model-runner": "Prompt schema v3 increased variable sequence lengths",
1431
- },
1432
- red_herring_logs={
1433
- "auth-service": "steady",
1434
- },
1435
- metrics={
1436
- "dash-inference": "p99_latency_ms 1400, oom_rate 3.2%",
1437
- "dash-gpu": "vram_fragmentation_score 0.74",
1438
- },
1439
- kb={
1440
- "kb-vram": "Recycle inference workers daily and pad sequences to bucketed lengths.",
1441
- },
1442
- good_handoff="investigator_agent",
1443
- accepted_fix_keywords=(
1444
- ("recycle", "inference", "workers"),
1445
- ("bucket", "prompt", "lengths"),
1446
- ("rollback", "prompt", "schema"),
1447
- ),
1448
- required_investigations=3,
1449
- customer_tier="enterprise",
1450
- affected_users_estimate=5_200,
1451
- revenue_impact_usd_per_min=1_850,
1452
- requires_mitigation=True,
1453
- postmortem_required=True,
1454
- )
1455
-
1456
-
1457
- def _replication_saturation() -> IncidentTemplate:
1458
- return IncidentTemplate(
1459
- id="INC-H7",
1460
- title="Cross-region replication lag blocks disaster-recovery RPO",
1461
- description=(
1462
- "Replication lag from the primary region to DR has exceeded "
1463
- "five minutes for the last hour, violating RPO=60s."
1464
- ),
1465
- category="data",
1466
- difficulty="hard",
1467
- root_cause="replication_saturation_during_backup_window",
1468
- root_cause_synonyms=(
1469
- "replication saturation during backup window",
1470
- "wal shipping backpressure",
1471
- "replica network saturation",
1472
- ),
1473
- clue_keywords=("replication", "lag", "wal", "rpo", "backup"),
1474
- signals=(
1475
- "Lag correlates exactly with nightly backup window",
1476
- "Network egress saturated on primary -> DR link",
1477
- ),
1478
- logs={
1479
- "db-primary": "WAL shipping backpressure; replica slot lagging 6.2m",
1480
- "backup-job": "Base backup in progress; 4.1 GB/s read rate",
1481
- },
1482
- red_herring_logs={
1483
- "notification-gateway": "steady delivery",
1484
- },
1485
- metrics={
1486
- "dash-replication": "lag_seconds 372 (rpo=60)",
1487
- "dash-network": "egress_primary_to_dr 9.8 Gbps (cap=10)",
1488
- },
1489
- kb={
1490
- "kb-replication-backup": "Throttle backup or move it off hours of peak replication traffic.",
1491
- },
1492
- good_handoff="ops_manager_agent",
1493
- accepted_fix_keywords=(
1494
- ("throttle", "backup", "rate"),
1495
- ("shift", "backup", "window"),
1496
- ("raise", "replication", "bandwidth"),
1497
- ),
1498
- required_investigations=3,
1499
- customer_tier="enterprise",
1500
- affected_users_estimate=8_900,
1501
- revenue_impact_usd_per_min=1_400,
1502
- requires_mitigation=True,
1503
- postmortem_required=True,
1504
- )
1505
-
1506
-
1507
- def _cache_key_collision() -> IncidentTemplate:
1508
- return IncidentTemplate(
1509
- id="INC-H8",
1510
- title="Cross-tenant data bleed from cache key collision",
1511
- description=(
1512
- "A rare cache key collision is briefly returning one enterprise "
1513
- "tenant's data to another. This is a data-isolation incident."
1514
- ),
1515
- category="security",
1516
- difficulty="hard",
1517
- root_cause="cache_key_collision_across_tenants",
1518
- root_cause_synonyms=(
1519
- "cache key collision across tenants",
1520
- "shared cache tenant bleed",
1521
- "tenant id missing from cache key",
1522
- ),
1523
- clue_keywords=("cache", "key", "collision", "tenant"),
1524
- signals=(
1525
- "Two enterprise tenants report seeing each other's dashboard metadata",
1526
- "Cache key construction omits tenant-id under a specific code path",
1527
- ),
1528
- logs={
1529
- "api-gateway": "Cache HIT for key=/v2/workspace/42 served to tenant=91",
1530
- "cache-layer": "Collision detected between tenants 42 and 91 on key prefix /v2/workspace",
1531
- },
1532
- red_herring_logs={
1533
- "email-service": "steady",
1534
- },
1535
- metrics={
1536
- "dash-cache": "collision_count 14 in last 2h",
1537
- "dash-security": "isolation_violations 2",
1538
- },
1539
- kb={
1540
- "kb-cache-tenant": "Prefix every cache key with tenant_id and enforce via lint check.",
1541
- },
1542
- good_handoff="ops_manager_agent",
1543
- accepted_fix_keywords=(
1544
- ("prefix", "tenant", "cache"),
1545
- ("invalidate", "shared", "cache"),
1546
- ("quarantine", "cache", "segment"),
1547
- ),
1548
- required_investigations=3,
1549
- customer_tier="enterprise",
1550
- affected_users_estimate=320,
1551
- revenue_impact_usd_per_min=2_100,
1552
- requires_mitigation=True,
1553
- postmortem_required=True,
1554
- )
1555
-
1556
-
1557
- def _cron_dst_double_trigger() -> IncidentTemplate:
1558
- return IncidentTemplate(
1559
- id="INC-H9",
1560
- title="Scheduled jobs fire twice at DST rollover",
1561
- description=(
1562
- "Key premium billing jobs executed twice at the daylight-saving "
1563
- "transition, causing premium charge duplicates."
1564
- ),
1565
- category="scheduling",
1566
- difficulty="hard",
1567
- root_cause="cron_dst_transition_double_trigger",
1568
- root_cause_synonyms=(
1569
- "cron dst transition double trigger",
1570
- "scheduler timezone ambiguity",
1571
- "dst fallback replay",
1572
- ),
1573
- clue_keywords=("cron", "dst", "timezone", "scheduler"),
1574
- signals=(
1575
- "Job history shows two runs at 01:00 and 01:00 local time",
1576
- "Billing duplicates concentrate on a single geographic region",
1577
- ),
1578
- logs={
1579
- "scheduler": "Fired job billing.nightly at 2026-03-29 01:00 (GMT+1 and GMT+0)",
1580
- "billing-worker": "Second invocation completed 12 minutes after first",
1581
- },
1582
- red_herring_logs={
1583
- "catalog-api": "steady 2xx",
1584
- },
1585
- metrics={
1586
- "dash-scheduler": "double_fire_count 3 (expected 0)",
1587
- "dash-billing": "duplicate_charge_rate 2.1%",
1588
- },
1589
- kb={
1590
- "kb-dst-schedule": "Anchor scheduled jobs on UTC and convert to local time at display only.",
1591
- },
1592
- good_handoff="investigator_agent",
1593
- accepted_fix_keywords=(
1594
- ("anchor", "schedule", "utc"),
1595
- ("deduplicate", "scheduled", "runs"),
1596
- ("reconcile", "duplicate", "charges"),
1597
- ),
1598
- required_investigations=3,
1599
- customer_tier="premium",
1600
- affected_users_estimate=6_400,
1601
- revenue_impact_usd_per_min=1_100,
1602
- requires_mitigation=True,
1603
- postmortem_required=True,
1604
- )
1605
-
1606
-
1607
- def _partial_publish_feed() -> IncidentTemplate:
1608
- return IncidentTemplate(
1609
- id="INC-H10",
1610
- title="Real-time feed gaps during partial publish",
1611
- description=(
1612
- "Premium trading-floor customers see gaps in the realtime price "
1613
- "feed after a publisher restart; some updates never arrived."
1614
- ),
1615
- category="realtime",
1616
- difficulty="hard",
1617
- root_cause="partial_publish_without_transaction_boundary",
1618
- root_cause_synonyms=(
1619
- "partial publish without transaction boundary",
1620
- "publisher crash mid batch",
1621
- "realtime feed gap",
1622
- ),
1623
- clue_keywords=("publish", "transaction", "feed", "partial"),
1624
- signals=(
1625
- "Sequence numbers skip in a bounded window around the publisher restart",
1626
- "Replay API can fill the gap but live subscribers missed it",
1627
- ),
1628
- logs={
1629
- "price-publisher": "Process restarted mid-batch, seq=88230 not flushed",
1630
- "realtime-bus": "Detected sequence gap 88230-88236 on channel=prices.us",
1631
- },
1632
- red_herring_logs={
1633
- "auth-service": "steady",
1634
- },
1635
- metrics={
1636
- "dash-realtime": "gap_count 6 in 30s, subscriber_reconcile_lag_s 48",
1637
- },
1638
- kb={
1639
- "kb-publish-txn": "Wrap each batch in a transactional publish so crashes never leave gaps.",
1640
- },
1641
- good_handoff="investigator_agent",
1642
- accepted_fix_keywords=(
1643
- ("enable", "transactional", "publish"),
1644
- ("replay", "sequence", "gap"),
1645
- ("force", "subscriber", "reconcile"),
1646
- ),
1647
- required_investigations=3,
1648
- customer_tier="premium",
1649
- affected_users_estimate=3_900,
1650
- revenue_impact_usd_per_min=1_750,
1651
- requires_mitigation=True,
1652
- postmortem_required=True,
1653
- )
1654
-
1655
-
1656
- def _ssd_firmware_regression() -> IncidentTemplate:
1657
- return IncidentTemplate(
1658
- id="INC-H11",
1659
- title="Storage checksum failures on upgraded SSD fleet",
1660
- description=(
1661
- "Enterprise object storage is returning checksum-mismatch errors "
1662
- "on a subset of volumes after a firmware roll-forward."
1663
- ),
1664
- category="storage",
1665
- difficulty="hard",
1666
- root_cause="ssd_firmware_checksum_regression",
1667
- root_cause_synonyms=(
1668
- "ssd firmware checksum regression",
1669
- "storage firmware corruption",
1670
- "nvme firmware crc bug",
1671
- ),
1672
- clue_keywords=("firmware", "ssd", "checksum", "storage"),
1673
- signals=(
1674
- "Checksum failures concentrate on volumes upgraded in the last 72 hours",
1675
- "Vendor advisory mentions similar symptoms after firmware F2.14",
1676
- ),
1677
- logs={
1678
- "storage-agent": "CRC mismatch on volume vol-221 firmware=F2.14",
1679
- "fleet-manager": "Upgrade batch included F2.14 for 18 volumes",
1680
- },
1681
- red_herring_logs={
1682
- "email-service": "steady",
1683
- },
1684
- metrics={
1685
- "dash-storage": "checksum_error_rate 0.8%",
1686
- "dash-fleet": "volumes_on_F2.14 18, volumes_healthy 402",
1687
- },
1688
- kb={
1689
- "kb-ssd-firmware": "Quarantine affected firmware and roll back to the last known-good version.",
1690
- },
1691
- good_handoff="ops_manager_agent",
1692
- accepted_fix_keywords=(
1693
- ("rollback", "ssd", "firmware"),
1694
- ("quarantine", "affected", "volumes"),
1695
- ("reseed", "checksum", "index"),
1696
- ),
1697
- required_investigations=3,
1698
- customer_tier="enterprise",
1699
- affected_users_estimate=1_800,
1700
- revenue_impact_usd_per_min=1_950,
1701
- requires_mitigation=True,
1702
- postmortem_required=True,
1703
- )
1704
-
1705
-
1706
  def build_incident_library() -> IncidentLibrary:
1707
- """Return the built-in enterprise incident library (30 templates)."""
1708
  return IncidentLibrary(
1709
  templates_by_task={
1710
- "easy": [
1711
- _redis_pool(),
1712
- _jwt_clock_skew(),
1713
- _email_spam_false_positive(),
1714
- _dns_ttl_stale(),
1715
- _cdn_purge_scope(),
1716
- _autocomplete_stale(),
1717
- _webhook_retry_budget(),
1718
- _thumbnail_worker_oom(),
1719
- ],
1720
  "medium": [
1721
  _cache_invalidation_lag(),
1722
  _tz_normalization(),
1723
  _invoice_idempotency(),
1724
  _tls_expiry(),
1725
  _feature_flag_rollout(),
1726
- _recommender_heap_leak(),
1727
- _consumer_group_rebalance(),
1728
- _config_push_skipped_canary(),
1729
- _health_check_flapping(),
1730
- _payment_webhook_dedupe(),
1731
- _origin_shield_bypass(),
1732
  ],
1733
  "hard": [
1734
  _promo_rate_cascade(),
@@ -1736,12 +868,6 @@ def build_incident_library() -> IncidentLibrary:
1736
  _alert_storm(),
1737
  _inventory_race(),
1738
  _deadlock_database(),
1739
- _gpu_memory_fragmentation(),
1740
- _replication_saturation(),
1741
- _cache_key_collision(),
1742
- _cron_dst_double_trigger(),
1743
- _partial_publish_feed(),
1744
- _ssd_firmware_regression(),
1745
  ],
1746
  }
1747
  )
 
850
  )
851
 
852
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
853
  def build_incident_library() -> IncidentLibrary:
854
+ """Return the built-in enterprise incident library."""
855
  return IncidentLibrary(
856
  templates_by_task={
857
+ "easy": [_redis_pool(), _jwt_clock_skew(), _email_spam_false_positive()],
 
 
 
 
 
 
 
 
 
858
  "medium": [
859
  _cache_invalidation_lag(),
860
  _tz_normalization(),
861
  _invoice_idempotency(),
862
  _tls_expiry(),
863
  _feature_flag_rollout(),
 
 
 
 
 
 
864
  ],
865
  "hard": [
866
  _promo_rate_cascade(),
 
868
  _alert_storm(),
869
  _inventory_race(),
870
  _deadlock_database(),
 
 
 
 
 
 
871
  ],
872
  }
873
  )