Spaces:

Torchflow1
/

Multi-Agent-Incident-Command-Center

Sleeping

App Files Files Community

SwapnilPatil28 commited on 28 days ago

Commit

8cbdbde

verified ·

1 Parent(s): 540b82c

Final Update

Browse files

Files changed (6) hide show

README.md +3 -7
docs/BLOG_POST.md +2 -5
docs/SUBMISSION_CHECKLIST.md +7 -9
scripts/before_after_demo.py +1 -1
server/app.py +1 -335
server/domain/incidents.py +2 -876

README.md CHANGED Viewed

@@ -43,7 +43,7 @@ A **virtual war room** where three specialist agents resolve a live queue of rea
 | 🧪 **Investigator** | Apply a fix · roll back a deploy | Escalate or file a post-mortem |
 | 👷 **Ops Manager** | Escalate · file post-mortem · **close the ticket** | Apply a code fix |
-**30 unique incident templates** · **3 difficulty tiers** (8 easy / 11 medium / 11 hard) · **14+ named reward signals** · **customer-tier weighting** (enterprise outages cost ~3× a free-tier outage)
 > Wrong actor → **−0.08**. Wrong root-cause on an enterprise ticket → **−1.98**. Correct closure on an enterprise ticket → **+1.44**. The rules matter — and every step tells you *why* it was scored.
@@ -113,7 +113,6 @@ Same pipeline, same data recipe, smaller backbone:
 | 💻 **Source code** | **[GitHub repo ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
 | 🎓 **Reproduce the training** | **[One-click Colab notebook ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
 | 📝 **Mini blog post** (the required short writeup) | **[`docs/BLOG_POST.md`](./docs/BLOG_POST.md)** |
-| 🎬 **2-minute video script** (optional bonus) | **[`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md)** |
 > Want the rubric math, architecture, full numbers, configuration, and the hackathon checklist? Keep scrolling — **Part 2** is the full technical README.
@@ -130,7 +129,6 @@ Same pipeline, same data recipe, smaller backbone:
 | GitHub repository | **[`github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
 | Training notebook (Colab T4, one-click reproducible) | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
 | Mini blog post (the required short writeup) | [`docs/BLOG_POST.md`](./docs/BLOG_POST.md) |
-| 2-minute video script (optional bonus) | [`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md) |
 | Submission checklist | [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md) |
 | Training script (Python) | [`train_trl.py`](./train_trl.py) |
@@ -638,7 +636,6 @@ Two scripts judges (or you) can run without a local IDE:
 │
 ├── docs/
 │   ├── BLOG_POST.md                   # The short writeup (rule 4) — renders on HF Space + GitHub
-│   ├── VIDEO_SCRIPT.md                # Optional 2-minute walkthrough script
 │   └── SUBMISSION_CHECKLIST.md        # Judging-criteria status + smoke tests
 │
 ├── artifacts/                         # All committed training evidence
@@ -661,7 +658,7 @@ Two scripts judges (or you) can run without a local IDE:
 │   ├── Dockerfile                     # Production image (HEALTHCHECK included)
 │   └── domain/
 │       ├── __init__.py
-│       ├── incidents.py               # 30 enterprise incident templates + factory
 │       ├── reward.py                  # Composable rubric engine (20+ components)
 │       ├── roles.py                   # Role-based permission policy
 │       └── rng.py                     # Deterministic per-episode RNG
@@ -697,7 +694,7 @@ ENV_LOG_LEVEL: "INFO"
 Full checklist with pre-submission smoke tests → [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md).
 - [x] **OpenEnv latest runtime** and `openenv validate` passing — [Space live](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)
-- [x] **Multi-agent, long-horizon environment** with role-gated action space (3 roles × 9 actions, **30 unique incident templates**)
 - [x] **Composable, transparent, anti-gaming reward rubric** (14+ named components, tier-scaled)
 - [x] **Business-impact-aware scoring** (customer tier, revenue impact, SLA countdown)
 - [x] **End-to-end TRL SFT pipeline** that saves a checkpoint and re-evaluates it in the environment ([`train_trl.py`](./train_trl.py))
@@ -708,7 +705,6 @@ Full checklist with pre-submission smoke tests → [`docs/SUBMISSION_CHECKLIST.m
 - [x] **Structured JSON logging** + 12-factor configuration
 - [x] **One-click Colab training notebook** → [Open ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)
 - [x] **Mini blog post** published as an MD file on both the HF Space and GitHub: [`docs/BLOG_POST.md`](./docs/BLOG_POST.md)
-- [x] **2-minute video script** (optional bonus): [`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md)
 - [x] **Full submission checklist** mapping every rule → evidence: [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md)
 ---

 | 🧪 **Investigator** | Apply a fix · roll back a deploy | Escalate or file a post-mortem |
 | 👷 **Ops Manager** | Escalate · file post-mortem · **close the ticket** | Apply a code fix |
+**13 real incidents** · **3 difficulty tiers** (easy / medium / hard) · **14+ named reward signals** · **customer-tier weighting** (enterprise outages cost ~3× a free-tier outage)
 > Wrong actor → **−0.08**. Wrong root-cause on an enterprise ticket → **−1.98**. Correct closure on an enterprise ticket → **+1.44**. The rules matter — and every step tells you *why* it was scored.
 | 💻 **Source code** | **[GitHub repo ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
 | 🎓 **Reproduce the training** | **[One-click Colab notebook ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
 | 📝 **Mini blog post** (the required short writeup) | **[`docs/BLOG_POST.md`](./docs/BLOG_POST.md)** |
 > Want the rubric math, architecture, full numbers, configuration, and the hackathon checklist? Keep scrolling — **Part 2** is the full technical README.
 | GitHub repository | **[`github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
 | Training notebook (Colab T4, one-click reproducible) | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
 | Mini blog post (the required short writeup) | [`docs/BLOG_POST.md`](./docs/BLOG_POST.md) |
 | Submission checklist | [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md) |
 | Training script (Python) | [`train_trl.py`](./train_trl.py) |
 │
 ├── docs/
 │   ├── BLOG_POST.md                   # The short writeup (rule 4) — renders on HF Space + GitHub
 │   └── SUBMISSION_CHECKLIST.md        # Judging-criteria status + smoke tests
 │
 ├── artifacts/                         # All committed training evidence
 │   ├── Dockerfile                     # Production image (HEALTHCHECK included)
 │   └── domain/
 │       ├── __init__.py
+│       ├── incidents.py               # 13 enterprise incident templates + factory
 │       ├── reward.py                  # Composable rubric engine (20+ components)
 │       ├── roles.py                   # Role-based permission policy
 │       └── rng.py                     # Deterministic per-episode RNG
 Full checklist with pre-submission smoke tests → [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md).
 - [x] **OpenEnv latest runtime** and `openenv validate` passing — [Space live](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)
+- [x] **Multi-agent, long-horizon environment** with role-gated action space (3 roles × 9 actions, 13 incidents)
 - [x] **Composable, transparent, anti-gaming reward rubric** (14+ named components, tier-scaled)
 - [x] **Business-impact-aware scoring** (customer tier, revenue impact, SLA countdown)
 - [x] **End-to-end TRL SFT pipeline** that saves a checkpoint and re-evaluates it in the environment ([`train_trl.py`](./train_trl.py))
 - [x] **Structured JSON logging** + 12-factor configuration
 - [x] **One-click Colab training notebook** → [Open ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)
 - [x] **Mini blog post** published as an MD file on both the HF Space and GitHub: [`docs/BLOG_POST.md`](./docs/BLOG_POST.md)
 - [x] **Full submission checklist** mapping every rule → evidence: [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md)
 ---

docs/BLOG_POST.md CHANGED Viewed

@@ -13,7 +13,6 @@
 | 💻 **GitHub source code** | **[github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
 | 🎓 **Reproducible training (Colab T4)** | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
 | 📖 **Full README** (story + technical deep-dive) | **[github.com/.../README.md ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center#readme)** |
-| 🎬 **2-min video walkthrough script** (optional bonus) | [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) |
 | ✅ **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |
 ---
@@ -24,7 +23,7 @@
 Each role has **different permissions**, **different information needs**, and a **different clock to beat**. Get it wrong and you bleed budget, bust the SLA, and — if the customer is on an enterprise contract — lose serious money (~3× what a free-tier outage costs).
-I built a simulator of that war room — an **OpenEnv-compatible** environment with **30 realistic incident templates**, 3 specialist roles, and 14+ named reward signals — and fine-tuned an LLM to run it.
 | Role | Can do | Cannot do |
 |---|---|---|
@@ -234,7 +233,6 @@ I ran the exact same pipeline with the smaller **Qwen2.5-0.5B-Instruct** backbon
 | **Source + tests** | [GitHub repo (21 passing tests, Dockerfile with HEALTHCHECK)](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
 | **Full docs** | [README — Part 1 story + Part 2 technical deep-dive](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center#readme) |
 | **Committed evidence** | [`artifacts/`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/tree/main/artifacts) — all 4 PNGs + both JSON metric files |
-| **2-min video script** (optional bonus) | [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) |
 | **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |
 ---
@@ -242,9 +240,8 @@ I ran the exact same pipeline with the smaller **Qwen2.5-0.5B-Instruct** backbon
 ## 8. What's next
 - **Replace SFT with GRPO or PPO** using the environment's native reward signal — no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
-- **Grow the incident catalog further** (now at 30 templates — next stop 50+ via JSON-defined scenarios).
 - **Add a second "adversarial" agent** that injects misleading signals to test robustness.
-- **Record the 2-minute walkthrough** from [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) as a bonus companion to this writeup.
 If you want to run it yourself, the Space and the repo are fully self-contained — `docker run` the image and point any OpenEnv-compatible client at it. Or just hit `/reset` and `/step` yourself from any language that can speak HTTP JSON.

 | 💻 **GitHub source code** | **[github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
 | 🎓 **Reproducible training (Colab T4)** | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
 | 📖 **Full README** (story + technical deep-dive) | **[github.com/.../README.md ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center#readme)** |
 | ✅ **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |
 ---
 Each role has **different permissions**, **different information needs**, and a **different clock to beat**. Get it wrong and you bleed budget, bust the SLA, and — if the customer is on an enterprise contract — lose serious money (~3× what a free-tier outage costs).
+I built a simulator of that war room — an **OpenEnv-compatible** environment with 13 realistic incidents, 3 specialist roles, and 14+ named reward signals — and fine-tuned an LLM to run it.
 | Role | Can do | Cannot do |
 |---|---|---|
 | **Source + tests** | [GitHub repo (21 passing tests, Dockerfile with HEALTHCHECK)](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
 | **Full docs** | [README — Part 1 story + Part 2 technical deep-dive](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center#readme) |
 | **Committed evidence** | [`artifacts/`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/tree/main/artifacts) — all 4 PNGs + both JSON metric files |
 | **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |
 ---
 ## 8. What's next
 - **Replace SFT with GRPO or PPO** using the environment's native reward signal — no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
+- **Scale the incident catalog** from 13 templates to 50+ (drop in JSON-defined scenarios).
 - **Add a second "adversarial" agent** that injects misleading signals to test robustness.
 If you want to run it yourself, the Space and the repo are fully self-contained — `docker run` the image and point any OpenEnv-compatible client at it. Or just hit `/reset` and `/step` yourself from any language that can speak HTTP JSON.

docs/SUBMISSION_CHECKLIST.md CHANGED Viewed

@@ -11,10 +11,10 @@ Status against every hard gate in the official judging rules, plus every polish
 | 1 | **Use OpenEnv (latest release). Build on top of the framework; don't reinvent the wheel.** | ✅ | `requirements.txt` pins `openenv-core>=0.2.2`, `openenv.yaml` has `version: "3.0"`, `server/environment.py` extends `openenv.core.environment.Environment`, app built via `openenv.core.env_server.create_fastapi_app`. |
 | 2 | **Working training script (Unsloth / HF TRL / any RL framework), ideally as a Colab notebook so judges can re-run it.** | ✅ | [`train_trl.py`](../train_trl.py) uses HF TRL `SFTTrainer`. **[One-click Colab notebook ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** runs the whole pipeline end-to-end on a T4 in ~1 h 15 min. |
 | 3 | **Evidence that you actually trained: at minimum, loss and reward plots from a real run.** | ✅ | Four plots committed to [`artifacts/`](../artifacts): `training_curve.png` (loss + token accuracy), `reward_curve.png` (4-policy reward by tier), `reward_components.png` (per-component breakdown), plus the 0.5B ablation `reward_curve_qwen0p5b.png`. Full `training_log.json` + `summary_metrics.json` committed alongside. |
-| 4 | **Short writeup or video: mini-blog on Hugging Face OR <2-min YouTube video, linked from README.** | ✅ | Mini-blog lives as [`docs/BLOG_POST.md`](./BLOG_POST.md) — shipped as part of the HF Space (rule 4 says "mini-blog on Hugging Face"; the Space is on HF and contains this file, so it renders at `huggingface.co/spaces/.../blob/main/docs/BLOG_POST.md`). All four training plots render inline via raw GitHub URLs. README and dashboard both link to it. A 2-minute walkthrough script is also committed at [`docs/VIDEO_SCRIPT.md`](./VIDEO_SCRIPT.md) as a bonus. |
 | 5 | **Push your environment to a Hugging Face Space so it's discoverable and runnable.** | ✅ | **Live at [`swapnilpatil28-multi-agent-incident-command-center.hf.space`](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** · Space page: [`huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center). |
 | 6 | **README motivates the problem, explains how the env works, and shows results.** | ✅ | [`README.md`](../README.md) — Part 1 ("Story in 2 minutes") opens with the problem in plain English, walks through the environment via role-permission tables, and shows all four plots + headline numbers. Part 2 is the full technical deep-dive (architecture, action/observation spaces, reward rubric, training pipeline, 0.5B ablation, ops/observability, testing, repo layout). |
-| 7 | **README links to the HF Space + all additional materials (video, blog, slides, etc.).** | ✅ | "Live links" table inside Part 2 of the README lists every resource. Part 1 also has a "Try it in 30 seconds" CTA table. The dashboard header plus "Resources & documentation" grid surface the same links from the live Space itself. |
 | 8 | **Do not include big video files in the HF submission — only public URLs.** | ✅ | No video files committed. All assets in [`artifacts/`](../artifacts) are PNG plots (≤ 162 KB each) + JSON. Repo weight is dominated by text and small images. |
 ---
@@ -26,7 +26,7 @@ Status against every hard gate in the official judging rules, plus every polish
 - [x] Multi-role, multi-agent — `triage_agent`, `investigator_agent`, `ops_manager_agent` with **non-overlapping permissions** (`server/domain/roles.py`).
 - [x] Long-horizon — 3–5 sequential incidents per episode, 20–60 steps each, shared SLA + budget counters.
 - [x] Professional / enterprise task simulation — realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
-- [x] **30 unique incident templates** across easy / medium / hard (`server/domain/incidents.py`) — 8 easy, 11 medium, 11 hard, covering services (payments, auth, CDN, search, DNS, ML inference, storage, scheduling, messaging, config distribution) and failure modes (OOM, cert expiry, config drift, DNS TTL staleness, rate-limit cascades, GPU fragmentation, cross-region replication lag, DST scheduler bugs, firmware regressions, cache-key tenant collisions).
 - [x] Rich observation schema — customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints, `reward_components`, `last_action_notes`.
 - [x] Composable reward rubric with **14+ named components** and anti-gaming safeguards (`server/domain/reward.py`).
 - [x] Tier-weighted business impact (`free ×0.6 · standard ×1.0 · premium ×1.4 · enterprise ×1.8`).
@@ -37,9 +37,8 @@ Status against every hard gate in the official judging rules, plus every polish
 - [x] README **Part 1 — The story in 2 minutes** written in plain English, readable by a non-technical judge in under 3 minutes.
 - [x] Every plot has a one-line caption explaining what it shows.
 - [x] Blog post [`docs/BLOG_POST.md`](./BLOG_POST.md) — eight labelled sections, four plots inline via raw GitHub URLs (render everywhere), 0.5B-vs-1.5B ablation narrative, explicit hackathon-theme mapping.
-- [x] Live HF Space dashboard has a **"Story in 2 minutes"** hero panel at the top, a role-permission table, a three-card theme mapping, and a "Resources & documentation" grid with 8 click-through links.
-- [x] Video script [`docs/VIDEO_SCRIPT.md`](./VIDEO_SCRIPT.md) committed (optional bonus; the blog satisfies the writeup rule by itself).
-- [x] All documentation cross-links cleanly — README ↔ dashboard ↔ blog post ↔ video script ↔ checklist.
 ### Improvement in Rewards (20%)
@@ -83,9 +82,9 @@ Status against every hard gate in the official judging rules, plus every polish
 |---|---|---|
 | 1 | Final training run (Qwen2.5-1.5B, 8 eps/task, 3 epochs) → all artifacts committed | ✅ |
 | 2 | Commit artifacts (`reward_curve.png`, `training_curve.png`, `reward_components.png`, `reward_curve_qwen0p5b.png`, `training_log.json`, `summary_metrics.json`, `summary_metrics_qwen0p5b.json`) | ✅ |
-| 3 | Update README with real numbers + real Space / Colab / GitHub / blog / video-script links | ✅ |
 | 4 | Deploy HF Space from the same commit | ✅ |
-| 5 | Dashboard upgraded: hero story panel, 4 stacked plots, resources grid with README / blog / video-script / checklist links | ✅ |
 | 6 | Blog post updated (`docs/BLOG_POST.md`) with fixed image paths (raw GitHub URLs) and 0.5B ablation section | ✅ |
 | 7 | All 21 tests passing on latest commit | ✅ |
 | 8 | Run `openenv validate` remotely against the Space — `./validate-submission.sh <space-url>` | ⬜ (run it once before the deadline) |
@@ -121,4 +120,3 @@ ENV_URL=https://swapnilpatil28-multi-agent-incident-command-center.hf.space pyth
 | Mini blog post (MD file in the repo, renders on both HF Space and GitHub) | [`docs/BLOG_POST.md`](./BLOG_POST.md) |
 | Reproducible training notebook | [Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing) |
 | Training evidence (all 4 plots + JSON metrics) | [`artifacts/`](../artifacts) folder |
-| 2-minute video script (optional bonus) | [`docs/VIDEO_SCRIPT.md`](./VIDEO_SCRIPT.md) |

 | 1 | **Use OpenEnv (latest release). Build on top of the framework; don't reinvent the wheel.** | ✅ | `requirements.txt` pins `openenv-core>=0.2.2`, `openenv.yaml` has `version: "3.0"`, `server/environment.py` extends `openenv.core.environment.Environment`, app built via `openenv.core.env_server.create_fastapi_app`. |
 | 2 | **Working training script (Unsloth / HF TRL / any RL framework), ideally as a Colab notebook so judges can re-run it.** | ✅ | [`train_trl.py`](../train_trl.py) uses HF TRL `SFTTrainer`. **[One-click Colab notebook ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** runs the whole pipeline end-to-end on a T4 in ~1 h 15 min. |
 | 3 | **Evidence that you actually trained: at minimum, loss and reward plots from a real run.** | ✅ | Four plots committed to [`artifacts/`](../artifacts): `training_curve.png` (loss + token accuracy), `reward_curve.png` (4-policy reward by tier), `reward_components.png` (per-component breakdown), plus the 0.5B ablation `reward_curve_qwen0p5b.png`. Full `training_log.json` + `summary_metrics.json` committed alongside. |
+| 4 | **Short writeup or video: mini-blog on Hugging Face OR <2-min YouTube video, linked from README.** | ✅ | Mini-blog lives as [`docs/BLOG_POST.md`](./BLOG_POST.md) — shipped as part of the HF Space (rule 4 says "mini-blog on Hugging Face"; the Space is on HF and contains this file, so it renders at `huggingface.co/spaces/.../blob/main/docs/BLOG_POST.md`). All four training plots render inline via raw GitHub URLs. README and dashboard both link to it. (No separate video submission.) |
 | 5 | **Push your environment to a Hugging Face Space so it's discoverable and runnable.** | ✅ | **Live at [`swapnilpatil28-multi-agent-incident-command-center.hf.space`](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** · Space page: [`huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center). |
 | 6 | **README motivates the problem, explains how the env works, and shows results.** | ✅ | [`README.md`](../README.md) — Part 1 ("Story in 2 minutes") opens with the problem in plain English, walks through the environment via role-permission tables, and shows all four plots + headline numbers. Part 2 is the full technical deep-dive (architecture, action/observation spaces, reward rubric, training pipeline, 0.5B ablation, ops/observability, testing, repo layout). |
+| 7 | **README links to the HF Space + all additional materials (blog, slides, etc.).** | ✅ | "Live links" table inside Part 2 of the README lists every resource. Part 1 also has a "Try it in 30 seconds" CTA table. The dashboard header plus "Resources & documentation" grid surface the same links from the live Space itself. |
 | 8 | **Do not include big video files in the HF submission — only public URLs.** | ✅ | No video files committed. All assets in [`artifacts/`](../artifacts) are PNG plots (≤ 162 KB each) + JSON. Repo weight is dominated by text and small images. |
 ---
 - [x] Multi-role, multi-agent — `triage_agent`, `investigator_agent`, `ops_manager_agent` with **non-overlapping permissions** (`server/domain/roles.py`).
 - [x] Long-horizon — 3–5 sequential incidents per episode, 20–60 steps each, shared SLA + budget counters.
 - [x] Professional / enterprise task simulation — realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
+- [x] 13 unique incident templates across easy / medium / hard (`server/domain/incidents.py`).
 - [x] Rich observation schema — customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints, `reward_components`, `last_action_notes`.
 - [x] Composable reward rubric with **14+ named components** and anti-gaming safeguards (`server/domain/reward.py`).
 - [x] Tier-weighted business impact (`free ×0.6 · standard ×1.0 · premium ×1.4 · enterprise ×1.8`).
 - [x] README **Part 1 — The story in 2 minutes** written in plain English, readable by a non-technical judge in under 3 minutes.
 - [x] Every plot has a one-line caption explaining what it shows.
 - [x] Blog post [`docs/BLOG_POST.md`](./BLOG_POST.md) — eight labelled sections, four plots inline via raw GitHub URLs (render everywhere), 0.5B-vs-1.5B ablation narrative, explicit hackathon-theme mapping.
+- [x] Live HF Space dashboard has a **"Story in 2 minutes"** hero panel at the top, a role-permission table, a three-card theme mapping, and a "Resources & documentation" grid with click-through links (README, blog, checklist, Colab, Space, etc.).
+- [x] All documentation cross-links cleanly — README ↔ dashboard ↔ blog post ↔ checklist.
 ### Improvement in Rewards (20%)
 |---|---|---|
 | 1 | Final training run (Qwen2.5-1.5B, 8 eps/task, 3 epochs) → all artifacts committed | ✅ |
 | 2 | Commit artifacts (`reward_curve.png`, `training_curve.png`, `reward_components.png`, `reward_curve_qwen0p5b.png`, `training_log.json`, `summary_metrics.json`, `summary_metrics_qwen0p5b.json`) | ✅ |
+| 3 | Update README with real numbers + real Space / Colab / GitHub / blog links | ✅ |
 | 4 | Deploy HF Space from the same commit | ✅ |
+| 5 | Dashboard upgraded: hero story panel, 4 stacked plots, resources grid with README / blog / checklist links | ✅ |
 | 6 | Blog post updated (`docs/BLOG_POST.md`) with fixed image paths (raw GitHub URLs) and 0.5B ablation section | ✅ |
 | 7 | All 21 tests passing on latest commit | ✅ |
 | 8 | Run `openenv validate` remotely against the Space — `./validate-submission.sh <space-url>` | ⬜ (run it once before the deadline) |
 | Mini blog post (MD file in the repo, renders on both HF Space and GitHub) | [`docs/BLOG_POST.md`](./BLOG_POST.md) |
 | Reproducible training notebook | [Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing) |
 | Training evidence (all 4 plots + JSON metrics) | [`artifacts/`](../artifacts) folder |

scripts/before_after_demo.py CHANGED Viewed

@@ -2,7 +2,7 @@
 Runs both policies against the same task under the same seed, prints a clean
 side-by-side trace, and writes ``artifacts/before_after_demo.md`` which you
-can paste into the blog post or screen-record for the video.
 Usage (after ``train_trl.py`` has saved ``artifacts/sft_model``)::

 Runs both policies against the same task under the same seed, prints a clean
 side-by-side trace, and writes ``artifacts/before_after_demo.md`` which you
+can paste into the blog post or other writeups.
 Usage (after ``train_trl.py`` has saved ``artifacts/sft_model``)::

server/app.py CHANGED Viewed

@@ -38,13 +38,8 @@ from server.domain.reward import (
     TIER_MULTIPLIER,
 )
 from server.environment import IncidentCommandCenterEnvironment
-from server import llm_remote
 from server.logging_utils import configure_logging
-import re as _re
-_JSON_RE = _re.compile(r"\{[\s\S]*\}")
 _LOG = logging.getLogger("icc.app")
 _CONFIG = EnvConfig.from_env()
 configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
@@ -61,7 +56,6 @@ COLAB_URL = "https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI
 # root for it; the other three open the HF file browser.
 README_URL = f"{SPACE_PAGE_URL}/blob/main/README.md"
 BLOG_POST_URL = f"{SPACE_PAGE_URL}/blob/main/docs/BLOG_POST.md"
-VIDEO_SCRIPT_URL = f"{SPACE_PAGE_URL}/blob/main/docs/VIDEO_SCRIPT.md"
 SUBMISSION_CHECKLIST_URL = f"{SPACE_PAGE_URL}/blob/main/docs/SUBMISSION_CHECKLIST.md"
 app = create_fastapi_app(
@@ -178,154 +172,6 @@ async def env_info() -> JSONResponse:
     return JSONResponse(_metadata_payload())
-# ---------------------------------------------------------------------------
-# Live LLM inference demo (optional — only enabled when HF credentials set)
-# ---------------------------------------------------------------------------
-def _build_demo_prompt(obs: IncidentObservation) -> str:
-    """Same prompt format the SFT model was fine-tuned on (train_trl.obs_to_prompt)."""
-    targets = obs.investigation_targets or {}
-    return (
-        "You are operating a multi-agent incident command center. "
-        "Pick the next action for the appropriate specialist role.\n\n"
-        f"Incident ID: {obs.incident_id}\n"
-        f"Title: {obs.incident_title}\n"
-        f"Description: {obs.incident_description}\n"
-        f"Customer tier: {obs.customer_tier} | "
-        f"Affected users: {obs.affected_users_estimate} | "
-        f"Revenue impact (USD/min): {obs.revenue_impact_usd_per_min}\n"
-        f"Postmortem required: {obs.postmortem_required}\n"
-        f"Visible signals: {', '.join(obs.visible_signals or [])}\n"
-        f"Available log targets: {', '.join(targets.get('logs', []) or [])}\n"
-        f"Available metric targets: {', '.join(targets.get('metrics', []) or [])}\n"
-        f"Available KB articles: {', '.join(targets.get('kb', []) or [])}\n"
-        f"Budget remaining: {obs.budget_remaining} actions | "
-        f"SLA remaining: {obs.sla_minutes_remaining} min | "
-        f"Clues found: {obs.clues_found} | "
-        f"Mitigation applied: {obs.mitigation_applied}\n"
-        f"Last terminal output: {obs.terminal_output}\n\n"
-        "Respond with a JSON object containing exactly these keys: "
-        "actor, action_type, target, root_cause, resolution_summary, "
-        "postmortem_note, confidence, reason."
-    )
-def _parse_llm_action(response_text: str) -> Dict[str, Any]:
-    """Extract the first balanced JSON object from a model response."""
-    match = _JSON_RE.search(response_text or "")
-    if not match:
-        return {}
-    raw = match.group(0)
-    last_close = raw.rfind("}")
-    if last_close != -1:
-        raw = raw[: last_close + 1]
-    try:
-        return json.loads(raw)
-    except (json.JSONDecodeError, TypeError):
-        return {}
-@app.get("/llm-demo-status", response_class=JSONResponse)
-async def llm_demo_status() -> JSONResponse:
-    """Report whether the live-inference panel is usable (credentials set)."""
-    return JSONResponse(llm_remote.status_summary())
-@app.post("/llm-demo", response_class=JSONResponse)
-async def llm_demo(payload: Dict[str, Any]) -> JSONResponse:
-    """Run one live step against the fine-tuned model behind an HF endpoint.
-    Spins up a fresh isolated ``IncidentCommandCenterEnvironment`` for each
-    call so the demo never disturbs the main environment instance that is
-    answering ``/reset`` and ``/step`` for training clients. Returns the full
-    trace (observation → prompt → raw LLM text → parsed action → reward) so
-    judges can see exactly what the model produced.
-    """
-    if not llm_remote.is_configured():
-        return JSONResponse(
-            {
-                "error": "Remote LLM not configured on this Space.",
-                "status": llm_remote.status_summary(),
-            },
-            status_code=503,
-        )
-    task_name = str(payload.get("task_name") or "easy").strip()
-    try:
-        seed = int(payload.get("seed") or _CONFIG.default_seed)
-    except (TypeError, ValueError):
-        seed = _CONFIG.default_seed
-    # Isolated env so the live demo never clobbers the shared state.
-    env = IncidentCommandCenterEnvironment()
-    obs = env.reset(task_name=task_name, seed=seed)
-    prompt = _build_demo_prompt(obs)
-    try:
-        raw_response = llm_remote.generate(prompt)
-    except Exception as exc:  # pragma: no cover - network-dependent
-        return JSONResponse(
-            {
-                "error": f"Remote LLM call failed: {exc}",
-                "status": llm_remote.status_summary(),
-            },
-            status_code=502,
-        )
-    parsed_action_dict = _parse_llm_action(raw_response)
-    try:
-        action = IncidentAction(**parsed_action_dict)
-        parsed_ok = True
-    except Exception:
-        logs = (obs.investigation_targets or {}).get("logs", []) or []
-        fallback_target = logs[0] if logs else "payments-api"
-        action = IncidentAction(
-            actor="triage_agent",
-            action_type="inspect_logs",
-            target=fallback_target,
-            reason="Fallback (LLM JSON invalid).",
-        )
-        parsed_ok = False
-    step_obs = env.step(action)
-    reward_components = dict(step_obs.reward_components or {})
-    reward_total = sum(reward_components.values()) if reward_components else 0.0
-    return JSONResponse(
-        {
-            "task_name": task_name,
-            "seed": seed,
-            "observation_before": {
-                "incident_id": obs.incident_id,
-                "incident_title": obs.incident_title,
-                "customer_tier": obs.customer_tier,
-                "affected_users_estimate": obs.affected_users_estimate,
-                "revenue_impact_usd_per_min": obs.revenue_impact_usd_per_min,
-                "visible_signals": obs.visible_signals,
-                "investigation_targets": obs.investigation_targets,
-                "budget_remaining": obs.budget_remaining,
-                "sla_minutes_remaining": obs.sla_minutes_remaining,
-            },
-            "prompt": prompt,
-            "raw_llm_response": raw_response,
-            "parsed_action": parsed_action_dict,
-            "validated_action": action.model_dump(exclude_none=True),
-            "fallback_used": not parsed_ok,
-            "step_result": {
-                "reward_total": round(reward_total, 4),
-                "reward_components": {
-                    k: round(v, 4) for k, v in reward_components.items()
-                },
-                "done": bool(step_obs.done),
-                "terminal_output": step_obs.terminal_output,
-                "last_action_notes": list(step_obs.last_action_notes or []),
-            },
-        }
-    )
 @app.get("/metrics", response_class=PlainTextResponse)
 async def metrics() -> PlainTextResponse:
     env = _resolve_environment()
@@ -479,81 +325,6 @@ def _dashboard_html() -> str:
     # so the existing `{themes_html}` slot renders to nothing (no duplication).
     themes_html = ""
-    # --- Live inference panel (only shown when HF credentials set) ----------
-    llm_status = llm_remote.status_summary()
-    if llm_status.get("configured"):
-        live_panel_html = f"""
-    <h2>Try the fine-tuned model live</h2>
-    <div class='card'>
-      <p class='sub'>
-        Spin up an isolated episode and watch the <strong>fine-tuned SFT model</strong>
-        pick the next action in real time. The prompt below is the exact format
-        used during training, so you can see how the model transforms a raw
-        observation into a typed <code>IncidentAction</code> — and the
-        environment's reward response.
-      </p>
-      <div class='live-controls'>
-        <label>Task
-          <select id='live-task'>
-            <option value='easy'>easy</option>
-            <option value='medium'>medium</option>
-            <option value='hard' selected>hard</option>
-          </select>
-        </label>
-        <label>Seed
-          <input id='live-seed' type='number' value='42' min='0' step='1' />
-        </label>
-        <button id='live-run' class='pill cta'>▶ Run one step</button>
-        <span id='live-status' class='sub'>Endpoint: {llm_status.get('host', '—')} · mode: {llm_status.get('mode', 'chat')}</span>
-      </div>
-      <div id='live-output' class='live-output' hidden>
-        <div class='live-grid'>
-          <div>
-            <h4>Observation (before)</h4>
-            <pre id='live-obs-before'></pre>
-          </div>
-          <div>
-            <h4>Prompt sent to model</h4>
-            <pre id='live-prompt'></pre>
-          </div>
-          <div>
-            <h4>Raw LLM response</h4>
-            <pre id='live-raw'></pre>
-          </div>
-          <div>
-            <h4>Parsed &amp; validated action</h4>
-            <pre id='live-action'></pre>
-          </div>
-          <div class='live-grid-full'>
-            <h4>Environment step result</h4>
-            <pre id='live-step'></pre>
-          </div>
-        </div>
-      </div>
-      <div id='live-error' class='live-error' hidden></div>
-    </div>
-"""
-    else:
-        live_panel_html = f"""
-    <h2>Try the fine-tuned model live</h2>
-    <div class='card'>
-      <p class='sub'>
-        <strong>Optional bonus panel.</strong> This Space can stream the
-        fine-tuned SFT model's decisions in real time when a Hugging Face
-        Inference Endpoint is attached. {llm_status.get('reason', '')}
-      </p>
-      <details>
-        <summary class='sub'>How the owner enables it</summary>
-        <ol>
-          <li>Upload the SFT checkpoint from <code>artifacts/sft_model/</code> to a model repo on the Hub.</li>
-          <li>Create a dedicated <a href='https://huggingface.co/inference-endpoints' target='_blank' rel='noopener'>Inference Endpoint</a> (T4 small is enough).</li>
-          <li>Set <code>LLM_ENDPOINT_URL</code> and <code>HF_TOKEN</code> as secrets on this Space.</li>
-          <li>Restart the Space — this panel turns on automatically.</li>
-        </ol>
-      </details>
-    </div>
-"""
     # --- Reward-rubric details ----------------------------------------------
     reward_rubric_rows = "".join(
         f"<tr><td><code>{name}</code></td><td>{value}</td></tr>"
@@ -630,40 +401,6 @@ def _dashboard_html() -> str:
     td.delta.good {{ color: var(--good); }}
     .links {{ display:flex; flex-wrap:wrap; gap:0.5rem; }}
-    /* Live-inference panel (fine-tuned SFT model behind HF Inference Endpoint). */
-    .live-controls {{
-      display:flex; flex-wrap:wrap; gap:1rem; align-items:center;
-      margin:0.75rem 0 1rem;
-    }}
-    .live-controls label {{
-      display:flex; flex-direction:column; gap:0.2rem;
-      font-size:0.8rem; color:var(--muted);
-    }}
-    .live-controls select, .live-controls input {{
-      background:#0b1225; border:1px solid #1f2a44; color:var(--text);
-      border-radius:8px; padding:0.35rem 0.55rem; font-size:0.9rem; min-width:110px;
-    }}
-    .live-controls button.pill.cta {{ cursor:pointer; border:0; }}
-    .live-controls button.pill.cta:disabled {{ opacity:0.6; cursor:wait; }}
-    .live-grid {{
-      display:grid; grid-template-columns: repeat(auto-fit, minmax(360px, 1fr));
-      gap:0.9rem; margin-top:0.5rem;
-    }}
-    .live-grid h4 {{
-      margin:0 0 0.3rem; font-size:0.85rem; color:#cbd5e1;
-      text-transform:uppercase; letter-spacing:0.04em;
-    }}
-    .live-grid .live-grid-full {{ grid-column: 1 / -1; }}
-    .live-grid pre {{
-      background:#0b1225; border:1px solid #1f2a44; border-radius:10px;
-      padding:0.75rem; margin:0; font-size:0.82rem; line-height:1.45;
-      max-height:320px; overflow:auto; white-space:pre-wrap; word-wrap:break-word;
-    }}
-    .live-error {{
-      background:#2a1418; border:1px solid #ef444455; color:#fca5a5;
-      border-radius:10px; padding:0.75rem; margin-top:0.75rem; font-size:0.9rem;
-    }}
     /* "Story in 2 minutes" hero panel — plain-English summary for judges. */
     .hero-card {{
       background: linear-gradient(135deg, #0f2647 0%, #172a4a 60%, #1f2a44 100%);
@@ -739,8 +476,7 @@ def _dashboard_html() -> str:
       <h3 style='margin-top:1.25rem'>What is the environment?</h3>
       <p class='sub' style='margin:0 0 0.75rem'>
         Three specialist agents with <strong>different permissions</strong> resolve
-        a live queue drawn from <strong>30 realistic tech incident templates</strong>
-        across 3 difficulty tiers.
       </p>
       <div class='table-wrap'>
         <table>
@@ -847,11 +583,6 @@ def _dashboard_html() -> str:
         <div class='res-title'>Mini blog post</div>
         <div class='sub'>The short writeup — MD file on the HF Space + GitHub</div>
       </a>
-      <a class='res-card' href='{VIDEO_SCRIPT_URL}' target='_blank' rel='noopener'>
-        <div class='res-icon'>🎬</div>
-        <div class='res-title'>2-minute video script</div>
-        <div class='sub'>Optional bonus — shot list + narration</div>
-      </a>
       <a class='res-card' href='{SUBMISSION_CHECKLIST_URL}' target='_blank' rel='noopener'>
         <div class='res-icon'>✅</div>
         <div class='res-title'>Submission checklist</div>
@@ -947,8 +678,6 @@ def _dashboard_html() -> str:
     {ablation_html}
-    {live_panel_html}
     {themes_html}
     <h2>Endpoints</h2>
@@ -1017,7 +746,6 @@ def _dashboard_html() -> str:
       <a href='{COLAB_URL}' target='_blank' rel='noopener'>Colab</a> ·
       <a href='{README_URL}' target='_blank' rel='noopener'>README</a> ·
       <a href='{BLOG_POST_URL}' target='_blank' rel='noopener'>Blog post</a> ·
-      <a href='{VIDEO_SCRIPT_URL}' target='_blank' rel='noopener'>Video script</a> ·
       <a href='{SUBMISSION_CHECKLIST_URL}' target='_blank' rel='noopener'>Submission checklist</a>
     </div>
   </footer>
@@ -1028,68 +756,6 @@ def _dashboard_html() -> str:
       const total = Object.values(data.incidents_per_task || {{}}).reduce((a,b)=>a+b,0);
       document.getElementById('kpi-inc').textContent = total;
     }} catch (e) {{}}
-    // Live fine-tuned-model demo. Only runs if the panel is rendered.
-    (function() {{
-      const runBtn = document.getElementById('live-run');
-      if (!runBtn) return;
-      const taskSel = document.getElementById('live-task');
-      const seedInp = document.getElementById('live-seed');
-      const out     = document.getElementById('live-output');
-      const err     = document.getElementById('live-error');
-      const obsPre  = document.getElementById('live-obs-before');
-      const promptPre = document.getElementById('live-prompt');
-      const rawPre  = document.getElementById('live-raw');
-      const actPre  = document.getElementById('live-action');
-      const stepPre = document.getElementById('live-step');
-      function showError(msg) {{
-        err.textContent = msg;
-        err.hidden = false;
-        out.hidden = true;
-      }}
-      function renderOutput(data) {{
-        err.hidden = true;
-        obsPre.textContent = JSON.stringify(data.observation_before || {{}}, null, 2);
-        promptPre.textContent = data.prompt || '';
-        rawPre.textContent = data.raw_llm_response || '(empty response)';
-        const fallbackTag = data.fallback_used
-          ? '// NOTE: LLM JSON was invalid — safe fallback action was used instead.\\n'
-          : '';
-        actPre.textContent = fallbackTag + JSON.stringify(data.validated_action || {{}}, null, 2);
-        stepPre.textContent = JSON.stringify(data.step_result || {{}}, null, 2);
-        out.hidden = false;
-      }}
-      runBtn.addEventListener('click', async () => {{
-        runBtn.disabled = true;
-        const label = runBtn.textContent;
-        runBtn.textContent = '⏳ Calling model…';
-        try {{
-          const resp = await fetch('/llm-demo', {{
-            method: 'POST',
-            headers: {{'Content-Type': 'application/json'}},
-            body: JSON.stringify({{
-              task_name: taskSel.value,
-              seed: Number(seedInp.value) || 0
-            }})
-          }});
-          const data = await resp.json();
-          if (!resp.ok) {{
-            showError((data && data.error) ? data.error : ('HTTP ' + resp.status));
-          }} else {{
-            renderOutput(data);
-          }}
-        }} catch (e) {{
-          showError('Network error: ' + e.message);
-        }} finally {{
-          runBtn.disabled = false;
-          runBtn.textContent = label;
-        }}
-      }});
-    }})();
   </script>
 </body>
 </html>

     TIER_MULTIPLIER,
 )
 from server.environment import IncidentCommandCenterEnvironment
 from server.logging_utils import configure_logging
 _LOG = logging.getLogger("icc.app")
 _CONFIG = EnvConfig.from_env()
 configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
 # root for it; the other three open the HF file browser.
 README_URL = f"{SPACE_PAGE_URL}/blob/main/README.md"
 BLOG_POST_URL = f"{SPACE_PAGE_URL}/blob/main/docs/BLOG_POST.md"
 SUBMISSION_CHECKLIST_URL = f"{SPACE_PAGE_URL}/blob/main/docs/SUBMISSION_CHECKLIST.md"
 app = create_fastapi_app(
     return JSONResponse(_metadata_payload())
 @app.get("/metrics", response_class=PlainTextResponse)
 async def metrics() -> PlainTextResponse:
     env = _resolve_environment()
     # so the existing `{themes_html}` slot renders to nothing (no duplication).
     themes_html = ""
     # --- Reward-rubric details ----------------------------------------------
     reward_rubric_rows = "".join(
         f"<tr><td><code>{name}</code></td><td>{value}</td></tr>"
     td.delta.good {{ color: var(--good); }}
     .links {{ display:flex; flex-wrap:wrap; gap:0.5rem; }}
     /* "Story in 2 minutes" hero panel — plain-English summary for judges. */
     .hero-card {{
       background: linear-gradient(135deg, #0f2647 0%, #172a4a 60%, #1f2a44 100%);
       <h3 style='margin-top:1.25rem'>What is the environment?</h3>
       <p class='sub' style='margin:0 0 0.75rem'>
         Three specialist agents with <strong>different permissions</strong> resolve
+        a live queue of 13 realistic tech incidents across 3 difficulty tiers.
       </p>
       <div class='table-wrap'>
         <table>
         <div class='res-title'>Mini blog post</div>
         <div class='sub'>The short writeup — MD file on the HF Space + GitHub</div>
       </a>
       <a class='res-card' href='{SUBMISSION_CHECKLIST_URL}' target='_blank' rel='noopener'>
         <div class='res-icon'>✅</div>
         <div class='res-title'>Submission checklist</div>
     {ablation_html}
     {themes_html}
     <h2>Endpoints</h2>
       <a href='{COLAB_URL}' target='_blank' rel='noopener'>Colab</a> ·
       <a href='{README_URL}' target='_blank' rel='noopener'>README</a> ·
       <a href='{BLOG_POST_URL}' target='_blank' rel='noopener'>Blog post</a> ·
       <a href='{SUBMISSION_CHECKLIST_URL}' target='_blank' rel='noopener'>Submission checklist</a>
     </div>
   </footer>
       const total = Object.values(data.incidents_per_task || {{}}).reduce((a,b)=>a+b,0);
       document.getElementById('kpi-inc').textContent = total;
     }} catch (e) {{}}
   </script>
 </body>
 </html>

server/domain/incidents.py CHANGED Viewed

@@ -850,885 +850,17 @@ def _deadlock_database() -> IncidentTemplate:
     )
-# ---------------------------------------------------------------------------
-# Extended catalog (round-2 polish)
-#
-# 17 additional templates balance the tier mix (free / standard / premium /
-# enterprise), add new service dimensions (DNS, CDN, ML inference, storage,
-# message queue, config distribution) and new failure modes (GPU memory leaks,
-# replication saturation, cache key collisions, firmware regressions, DST
-# bugs). Each template follows the same pattern as INC-E1..H5 so the reward
-# rubric, environment plumbing and training scripts require no changes.
-# ---------------------------------------------------------------------------
-def _dns_ttl_stale() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-E4",
-        title="Stale DNS routes free-tier API traffic to drained region",
-        description=(
-            "Free-tier API callers keep hitting a drained region even after "
-            "a planned failover because DNS TTLs have not expired."
-        ),
-        category="networking",
-        difficulty="easy",
-        root_cause="dns_ttl_stale_after_failover",
-        root_cause_synonyms=(
-            "dns ttl stale after failover",
-            "stale dns record",
-            "long ttl blocking failover",
-        ),
-        clue_keywords=("dns", "ttl", "failover", "drain"),
-        signals=(
-            "Traffic ratio to drained region stays above 30% 30 minutes post-failover",
-            "Only free-tier resolvers (no Anycast) are affected",
-        ),
-        logs={
-            "dns-edge": "A record TTL=3600s still cached at regional resolvers",
-            "traffic-router": "Residual traffic observed on drained region us-west-2b",
-        },
-        red_herring_logs={
-            "payments-api": "steady 2xx",
-        },
-        metrics={
-            "dash-dns": "ttl_expired_ratio 0.71 (expected >0.95)",
-            "dash-router": "drained_region_share 34%",
-        },
-        red_herring_metrics={
-            "dash-cdn": "hit_ratio 95%",
-        },
-        kb={
-            "kb-dns-ttl": "Pre-lower TTL to 60s at least 2 TTLs before planned failovers.",
-        },
-        good_handoff="triage_agent",
-        accepted_fix_keywords=(
-            ("shorten", "dns", "ttl"),
-            ("force", "resolver", "refresh"),
-            ("rollback", "region", "drain"),
-        ),
-        required_investigations=1,
-        customer_tier="free",
-        affected_users_estimate=2_500,
-        revenue_impact_usd_per_min=15,
-        requires_mitigation=True,
-    )
-def _cdn_purge_scope() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-E5",
-        title="CDN purge missed a hot asset after release",
-        description=(
-            "A marketing banner refresh missed a subset of CDN edges, so a "
-            "fraction of standard-tier users see the old creative."
-        ),
-        category="cdn",
-        difficulty="easy",
-        root_cause="cdn_purge_scope_mismatch",
-        root_cause_synonyms=(
-            "cdn purge scope mismatch",
-            "edge purge partial",
-            "shield purge missed",
-        ),
-        clue_keywords=("cdn", "purge", "edge", "shield"),
-        signals=(
-            "Small but persistent share of stale banner impressions",
-            "Affected edges cluster on a single PoP provider",
-        ),
-        logs={
-            "cdn-control-plane": "Purge job completed with 14 edges skipped (policy=legacy)",
-            "edge-pop-bom-1": "Serving banner_v12 while origin is on banner_v13",
-        },
-        metrics={
-            "dash-cdn": "stale_object_rate 1.4%, edge_sync_lag_s 312",
-        },
-        red_herring_metrics={
-            "dash-auth": "401_rate 0.2%",
-        },
-        kb={
-            "kb-cdn-purge": "Always use wildcard purge with full edge fanout for visual assets.",
-        },
-        good_handoff="investigator_agent",
-        accepted_fix_keywords=(
-            ("reissue", "cdn", "purge"),
-            ("fanout", "edge", "invalidation"),
-            ("rotate", "asset", "hash"),
-        ),
-        required_investigations=1,
-        customer_tier="standard",
-        affected_users_estimate=11_000,
-        revenue_impact_usd_per_min=60,
-        requires_mitigation=True,
-    )
-def _autocomplete_stale() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-E6",
-        title="Search autocomplete missing this week's products",
-        description=(
-            "Free-tier shoppers see a stale autocomplete list that does not "
-            "surface new SKUs released this Monday."
-        ),
-        category="search",
-        difficulty="easy",
-        root_cause="autocomplete_index_rebuild_skipped",
-        root_cause_synonyms=(
-            "autocomplete index rebuild skipped",
-            "suggestion index stale",
-            "nightly reindex missed",
-        ),
-        clue_keywords=("autocomplete", "index", "reindex", "suggestion"),
-        signals=(
-            "New SKUs launched Monday never appear in suggest responses",
-            "Full text search returns them correctly",
-        ),
-        logs={
-            "suggest-indexer": "Scheduled rebuild skipped (upstream lock held)",
-            "suggest-api": "Serving snapshot v88 (expected v91)",
-        },
-        red_herring_logs={
-            "payments-api": "steady 2xx",
-        },
-        metrics={
-            "dash-suggest": "index_version 88, target_version 91",
-            "dash-search": "full_text_recall 99%, autocomplete_recall 71%",
-        },
-        kb={
-            "kb-autocomplete": "Reindex lock must release on job exit and alert on missed window.",
-        },
-        good_handoff="ops_manager_agent",
-        accepted_fix_keywords=(
-            ("force", "index", "rebuild"),
-            ("release", "reindex", "lock"),
-            ("promote", "suggestion", "snapshot"),
-        ),
-        required_investigations=1,
-        customer_tier="free",
-        affected_users_estimate=18_000,
-        revenue_impact_usd_per_min=30,
-        requires_mitigation=True,
-    )
-def _webhook_retry_budget() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-E7",
-        title="Partner webhooks silently dropping",
-        description=(
-            "A handful of partner integrations stopped receiving webhook "
-            "deliveries after a downstream 429 spike."
-        ),
-        category="integrations",
-        difficulty="easy",
-        root_cause="webhook_retry_budget_exhausted",
-        root_cause_synonyms=(
-            "webhook retry budget exhausted",
-            "partner webhook giving up",
-            "429 retry exhaustion",
-        ),
-        clue_keywords=("webhook", "retry", "429", "budget"),
-        signals=(
-            "Deliveries succeed for some partners and silently fail for others",
-            "Affected partners all share a single rate-limit bucket",
-        ),
-        logs={
-            "webhook-dispatcher": "Retry budget exhausted for partner_bucket=bucket-7",
-            "partner-gateway": "HTTP 429 for 22 consecutive attempts on bucket-7",
-        },
-        red_herring_logs={
-            "catalog-api": "steady 2xx",
-        },
-        metrics={
-            "dash-webhooks": "delivery_success_bucket7 34%, retry_budget_remaining 0",
-        },
-        kb={
-            "kb-webhook-retry": "Split rate-limit buckets per partner and reset retry budgets on recovery.",
-        },
-        good_handoff="ops_manager_agent",
-        accepted_fix_keywords=(
-            ("split", "retry", "bucket"),
-            ("reset", "retry", "budget"),
-            ("pause", "partner", "bucket"),
-        ),
-        required_investigations=2,
-        customer_tier="standard",
-        affected_users_estimate=1_400,
-        revenue_impact_usd_per_min=80,
-        requires_mitigation=True,
-    )
-def _thumbnail_worker_oom() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-E8",
-        title="User profile thumbnails render blank on mobile",
-        description=(
-            "Free-tier mobile users see empty circles where their profile "
-            "photo should appear, intermittently."
-        ),
-        category="media",
-        difficulty="easy",
-        root_cause="thumbnail_worker_oom_killed",
-        root_cause_synonyms=(
-            "thumbnail worker oom killed",
-            "image worker out of memory",
-            "thumbnailer oom loop",
-        ),
-        clue_keywords=("thumbnail", "oom", "memory", "worker"),
-        signals=(
-            "Missing thumbnails correlate with HEIC uploads from newer devices",
-            "CPU is normal but worker restart count is spiking",
-        ),
-        logs={
-            "thumbnail-worker": "SIGKILL received (oom_score_adj=500)",
-            "image-pipeline": "HEIC decoder peak rss 1.9GB on large uploads",
-        },
-        metrics={
-            "dash-thumbnails": "render_success 82%, worker_restarts 240/hr",
-            "dash-k8s": "pod_oom_kill_count 42",
-        },
-        kb={
-            "kb-thumbnail": "Cap HEIC decode memory or reject above 30MP at the edge.",
-        },
-        good_handoff="triage_agent",
-        accepted_fix_keywords=(
-            ("raise", "memory", "limit"),
-            ("reject", "oversized", "heic"),
-            ("downscale", "before", "decode"),
-        ),
-        required_investigations=2,
-        customer_tier="free",
-        affected_users_estimate=55_000,
-        revenue_impact_usd_per_min=20,
-        requires_mitigation=True,
-    )
-def _recommender_heap_leak() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-M6",
-        title="Recommender latency drifts up after model swap",
-        description=(
-            "Homepage recommendation latency is drifting up over six hours "
-            "since this morning's model swap. p99 is now 2.1s."
-        ),
-        category="recommendations",
-        difficulty="medium",
-        root_cause="recommender_heap_leak_after_model_swap",
-        root_cause_synonyms=(
-            "recommender heap leak after model swap",
-            "embedding cache not released",
-            "old model tensors pinned",
-        ),
-        clue_keywords=("heap", "leak", "embedding", "model", "swap"),
-        signals=(
-            "Heap utilisation climbs 2% / hour since deploy",
-            "Full GC frequency doubled but does not recover memory",
-        ),
-        logs={
-            "recommender-service": "Loaded model v42; previous tensors not released",
-            "jvm-gc": "Old gen occupancy 88% after full GC",
-        },
-        red_herring_logs={
-            "catalog-api": "steady 2xx",
-        },
-        metrics={
-            "dash-recommender": "p99_latency_ms 2100, heap_used_pct 88",
-            "dash-jvm": "full_gc_per_min 4, reclaimed_bytes_low",
-        },
-        red_herring_metrics={
-            "dash-search": "ctr steady",
-        },
-        kb={
-            "kb-model-swap": "Release previous model tensors explicitly before binding the new one.",
-        },
-        good_handoff="investigator_agent",
-        accepted_fix_keywords=(
-            ("release", "previous", "model"),
-            ("unload", "embedding", "cache"),
-            ("rollback", "model", "swap"),
-        ),
-        required_investigations=2,
-        customer_tier="premium",
-        affected_users_estimate=95_000,
-        revenue_impact_usd_per_min=410,
-        requires_mitigation=True,
-    )
-def _consumer_group_rebalance() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-M7",
-        title="Order events stuck behind consumer rebalance storm",
-        description=(
-            "Order processing lag spiked after a rolling restart and has not "
-            "recovered; fresh orders are 90s behind real time."
-        ),
-        category="messaging",
-        difficulty="medium",
-        root_cause="consumer_group_rebalance_storm",
-        root_cause_synonyms=(
-            "consumer group rebalance storm",
-            "kafka consumer thrashing",
-            "repeated partition reassignment",
-        ),
-        clue_keywords=("kafka", "consumer", "rebalance", "partition"),
-        signals=(
-            "Consumer group rebalanced 11 times in 5 minutes",
-            "Lag stuck even though CPU is at 30%",
-        ),
-        logs={
-            "order-consumer": "Rebalance triggered: member id rotated, session timeout=10s",
-            "kafka-coordinator": "Generation 412 -> 423 in 5m, partitions churning",
-        },
-        red_herring_logs={
-            "auth-service": "normal 2xx",
-        },
-        metrics={
-            "dash-orders": "consumer_lag 90s, rebalance_count_5m 11",
-            "dash-kafka": "generation_rotations 2.2/min",
-        },
-        kb={
-            "kb-consumer-tuning": "Raise session.timeout.ms and heartbeat.interval.ms to avoid false expulsion.",
-        },
-        good_handoff="ops_manager_agent",
-        accepted_fix_keywords=(
-            ("raise", "session", "timeout"),
-            ("pin", "static", "membership"),
-            ("stabilise", "consumer", "group"),
-        ),
-        required_investigations=2,
-        customer_tier="premium",
-        affected_users_estimate=48_000,
-        revenue_impact_usd_per_min=520,
-        requires_mitigation=True,
-    )
-def _config_push_skipped_canary() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-M8",
-        title="Enterprise tenants hit TLS verify failures after config push",
-        description=(
-            "A global config change flipped a TLS verification flag in "
-            "production without going through canary."
-        ),
-        category="platform",
-        difficulty="medium",
-        root_cause="config_push_skipped_canary",
-        root_cause_synonyms=(
-            "config push skipped canary",
-            "global config bypassed stage",
-            "bulk config rollout regression",
-        ),
-        clue_keywords=("config", "canary", "push", "rollout"),
-        signals=(
-            "Enterprise tenants see TLS verify errors 3 minutes after deploy",
-            "Canary stage shows zero traffic for this change",
-        ),
-        logs={
-            "config-service": "Changeset CR-8812 applied globally (stages=[])",
-            "api-gateway": "TLS verify flag=strict caused downstream handshake failures",
-        },
-        red_herring_logs={
-            "email-service": "no anomalies",
-        },
-        metrics={
-            "dash-config": "canary_coverage 0%, rollout_surface 100%",
-            "dash-gateway": "tls_verify_failures 8.3%",
-        },
-        kb={
-            "kb-config-rollout": "Require canary + 15 minutes bake before promoting config changes.",
-        },
-        good_handoff="ops_manager_agent",
-        accepted_fix_keywords=(
-            ("rollback", "config", "change"),
-            ("re-enable", "canary", "stage"),
-            ("revert", "tls", "flag"),
-        ),
-        required_investigations=2,
-        customer_tier="enterprise",
-        affected_users_estimate=2_100,
-        revenue_impact_usd_per_min=640,
-        requires_mitigation=True,
-        postmortem_required=True,
-    )
-def _health_check_flapping() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-M9",
-        title="Autoscaler thrashing under brief latency blips",
-        description=(
-            "Autoscaler is adding and removing pods every 2 minutes in "
-            "response to very short latency blips."
-        ),
-        category="platform",
-        difficulty="medium",
-        root_cause="health_check_timeout_too_aggressive",
-        root_cause_synonyms=(
-            "health check timeout too aggressive",
-            "liveness probe too tight",
-            "autoscaler oscillating",
-        ),
-        clue_keywords=("health", "check", "liveness", "autoscaler"),
-        signals=(
-            "Pod churn 6x baseline with no underlying load change",
-            "Brief p99 blips align with scale events, not incidents",
-        ),
-        logs={
-            "kubelet": "Liveness probe failed: HTTP 500 after 800ms",
-            "autoscaler": "Scale up triggered; 3 pods added, 2 removed within 2m",
-        },
-        red_herring_logs={
-            "payments-api": "steady 2xx",
-        },
-        metrics={
-            "dash-k8s": "pod_churn_per_min 9, cpu_avg 42%",
-            "dash-slo": "p99_latency_ms spikes tied to scale events",
-        },
-        kb={
-            "kb-health-probe": "Raise liveness timeout and stagger readiness to avoid flap-driven scale events.",
-        },
-        good_handoff="triage_agent",
-        accepted_fix_keywords=(
-            ("raise", "probe", "timeout"),
-            ("dampen", "autoscaler", "cooldown"),
-            ("relax", "liveness", "threshold"),
-        ),
-        required_investigations=2,
-        customer_tier="standard",
-        affected_users_estimate=31_000,
-        revenue_impact_usd_per_min=210,
-        requires_mitigation=True,
-    )
-def _payment_webhook_dedupe() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-M10",
-        title="Payment confirmations delivered twice to enterprise partners",
-        description=(
-            "Two enterprise payment partners received the same confirmation "
-            "webhook twice for a subset of transactions."
-        ),
-        category="payments",
-        difficulty="medium",
-        root_cause="webhook_dedupe_window_too_narrow",
-        root_cause_synonyms=(
-            "webhook dedupe window too narrow",
-            "payment webhook duplicate delivery",
-            "idempotency window clock drift",
-        ),
-        clue_keywords=("webhook", "dedupe", "idempotency", "window"),
-        signals=(
-            "Duplicates concentrated on retries across failover boundary",
-            "Dedupe cache TTL is shorter than retry backoff",
-        ),
-        logs={
-            "payments-webhook": "Duplicate delivery for txn T-332a after dedupe cache eviction",
-            "scheduler": "Retry backoff 90s; dedupe ttl=60s",
-        },
-        red_herring_logs={
-            "email-service": "steady",
-        },
-        metrics={
-            "dash-payments": "duplicate_webhook_rate 0.9%, dedupe_hit_rate 88%",
-        },
-        kb={
-            "kb-webhook-dedupe": "Dedupe TTL must exceed the maximum retry backoff window.",
-        },
-        good_handoff="investigator_agent",
-        accepted_fix_keywords=(
-            ("extend", "dedupe", "ttl"),
-            ("shrink", "retry", "backoff"),
-            ("persist", "dedupe", "store"),
-        ),
-        required_investigations=2,
-        customer_tier="enterprise",
-        affected_users_estimate=620,
-        revenue_impact_usd_per_min=480,
-        requires_mitigation=True,
-        postmortem_required=True,
-    )
-def _origin_shield_bypass() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-M11",
-        title="Origin overloaded after CDN policy change",
-        description=(
-            "Origin servers are seeing 5x normal traffic because a CDN "
-            "policy change disabled origin shield for a large segment."
-        ),
-        category="cdn",
-        difficulty="medium",
-        root_cause="origin_shield_bypass_after_policy_change",
-        root_cause_synonyms=(
-            "origin shield bypass after policy change",
-            "shield disabled for segment",
-            "cache hierarchy collapsed",
-        ),
-        clue_keywords=("origin", "shield", "cdn", "policy"),
-        signals=(
-            "Origin 5xx rate climbs as CDN hit ratio collapses",
-            "New CDN policy rolled out exactly at fault onset",
-        ),
-        logs={
-            "cdn-policy": "Policy v5 removed shield targeting for premium segment",
-            "origin-lb": "Connection queue depth spiking 5x baseline",
-        },
-        red_herring_logs={
-            "dns-resolver": "no anomalies",
-        },
-        metrics={
-            "dash-cdn": "hit_ratio 67% (baseline 94%)",
-            "dash-origin": "rps 5.2x baseline, 5xx_rate 7.1%",
-        },
-        kb={
-            "kb-origin-shield": "Changes to shield routing must go through shadow traffic before promotion.",
-        },
-        good_handoff="investigator_agent",
-        accepted_fix_keywords=(
-            ("rollback", "cdn", "policy"),
-            ("re-enable", "origin", "shield"),
-            ("route", "through", "shield"),
-        ),
-        required_investigations=3,
-        customer_tier="premium",
-        affected_users_estimate=240_000,
-        revenue_impact_usd_per_min=1_300,
-        requires_mitigation=True,
-        postmortem_required=True,
-    )
-def _gpu_memory_fragmentation() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-H6",
-        title="LLM inference latency drifts up on production A100 pool",
-        description=(
-            "Enterprise API latency for the inference gateway has drifted "
-            "from 420ms to 1.4s over 36 hours, with OOMs on larger prompts."
-        ),
-        category="ml_inference",
-        difficulty="hard",
-        root_cause="gpu_memory_fragmentation_after_prompt_schema_change",
-        root_cause_synonyms=(
-            "gpu memory fragmentation after prompt schema change",
-            "kv cache fragmentation",
-            "inference pool memory fragmentation",
-        ),
-        clue_keywords=("gpu", "memory", "fragmentation", "kv", "cache"),
-        signals=(
-            "Free VRAM fragmented into small blocks even though total free > 18GB",
-            "OOM errors concentrate on prompts >2k tokens",
-        ),
-        logs={
-            "inference-gateway": "CUDA OOM despite torch reports 18GB free; fragmentation detected",
-            "model-runner": "Prompt schema v3 increased variable sequence lengths",
-        },
-        red_herring_logs={
-            "auth-service": "steady",
-        },
-        metrics={
-            "dash-inference": "p99_latency_ms 1400, oom_rate 3.2%",
-            "dash-gpu": "vram_fragmentation_score 0.74",
-        },
-        kb={
-            "kb-vram": "Recycle inference workers daily and pad sequences to bucketed lengths.",
-        },
-        good_handoff="investigator_agent",
-        accepted_fix_keywords=(
-            ("recycle", "inference", "workers"),
-            ("bucket", "prompt", "lengths"),
-            ("rollback", "prompt", "schema"),
-        ),
-        required_investigations=3,
-        customer_tier="enterprise",
-        affected_users_estimate=5_200,
-        revenue_impact_usd_per_min=1_850,
-        requires_mitigation=True,
-        postmortem_required=True,
-    )
-def _replication_saturation() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-H7",
-        title="Cross-region replication lag blocks disaster-recovery RPO",
-        description=(
-            "Replication lag from the primary region to DR has exceeded "
-            "five minutes for the last hour, violating RPO=60s."
-        ),
-        category="data",
-        difficulty="hard",
-        root_cause="replication_saturation_during_backup_window",
-        root_cause_synonyms=(
-            "replication saturation during backup window",
-            "wal shipping backpressure",
-            "replica network saturation",
-        ),
-        clue_keywords=("replication", "lag", "wal", "rpo", "backup"),
-        signals=(
-            "Lag correlates exactly with nightly backup window",
-            "Network egress saturated on primary -> DR link",
-        ),
-        logs={
-            "db-primary": "WAL shipping backpressure; replica slot lagging 6.2m",
-            "backup-job": "Base backup in progress; 4.1 GB/s read rate",
-        },
-        red_herring_logs={
-            "notification-gateway": "steady delivery",
-        },
-        metrics={
-            "dash-replication": "lag_seconds 372 (rpo=60)",
-            "dash-network": "egress_primary_to_dr 9.8 Gbps (cap=10)",
-        },
-        kb={
-            "kb-replication-backup": "Throttle backup or move it off hours of peak replication traffic.",
-        },
-        good_handoff="ops_manager_agent",
-        accepted_fix_keywords=(
-            ("throttle", "backup", "rate"),
-            ("shift", "backup", "window"),
-            ("raise", "replication", "bandwidth"),
-        ),
-        required_investigations=3,
-        customer_tier="enterprise",
-        affected_users_estimate=8_900,
-        revenue_impact_usd_per_min=1_400,
-        requires_mitigation=True,
-        postmortem_required=True,
-    )
-def _cache_key_collision() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-H8",
-        title="Cross-tenant data bleed from cache key collision",
-        description=(
-            "A rare cache key collision is briefly returning one enterprise "
-            "tenant's data to another. This is a data-isolation incident."
-        ),
-        category="security",
-        difficulty="hard",
-        root_cause="cache_key_collision_across_tenants",
-        root_cause_synonyms=(
-            "cache key collision across tenants",
-            "shared cache tenant bleed",
-            "tenant id missing from cache key",
-        ),
-        clue_keywords=("cache", "key", "collision", "tenant"),
-        signals=(
-            "Two enterprise tenants report seeing each other's dashboard metadata",
-            "Cache key construction omits tenant-id under a specific code path",
-        ),
-        logs={
-            "api-gateway": "Cache HIT for key=/v2/workspace/42 served to tenant=91",
-            "cache-layer": "Collision detected between tenants 42 and 91 on key prefix /v2/workspace",
-        },
-        red_herring_logs={
-            "email-service": "steady",
-        },
-        metrics={
-            "dash-cache": "collision_count 14 in last 2h",
-            "dash-security": "isolation_violations 2",
-        },
-        kb={
-            "kb-cache-tenant": "Prefix every cache key with tenant_id and enforce via lint check.",
-        },
-        good_handoff="ops_manager_agent",
-        accepted_fix_keywords=(
-            ("prefix", "tenant", "cache"),
-            ("invalidate", "shared", "cache"),
-            ("quarantine", "cache", "segment"),
-        ),
-        required_investigations=3,
-        customer_tier="enterprise",
-        affected_users_estimate=320,
-        revenue_impact_usd_per_min=2_100,
-        requires_mitigation=True,
-        postmortem_required=True,
-    )
-def _cron_dst_double_trigger() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-H9",
-        title="Scheduled jobs fire twice at DST rollover",
-        description=(
-            "Key premium billing jobs executed twice at the daylight-saving "
-            "transition, causing premium charge duplicates."
-        ),
-        category="scheduling",
-        difficulty="hard",
-        root_cause="cron_dst_transition_double_trigger",
-        root_cause_synonyms=(
-            "cron dst transition double trigger",
-            "scheduler timezone ambiguity",
-            "dst fallback replay",
-        ),
-        clue_keywords=("cron", "dst", "timezone", "scheduler"),
-        signals=(
-            "Job history shows two runs at 01:00 and 01:00 local time",
-            "Billing duplicates concentrate on a single geographic region",
-        ),
-        logs={
-            "scheduler": "Fired job billing.nightly at 2026-03-29 01:00 (GMT+1 and GMT+0)",
-            "billing-worker": "Second invocation completed 12 minutes after first",
-        },
-        red_herring_logs={
-            "catalog-api": "steady 2xx",
-        },
-        metrics={
-            "dash-scheduler": "double_fire_count 3 (expected 0)",
-            "dash-billing": "duplicate_charge_rate 2.1%",
-        },
-        kb={
-            "kb-dst-schedule": "Anchor scheduled jobs on UTC and convert to local time at display only.",
-        },
-        good_handoff="investigator_agent",
-        accepted_fix_keywords=(
-            ("anchor", "schedule", "utc"),
-            ("deduplicate", "scheduled", "runs"),
-            ("reconcile", "duplicate", "charges"),
-        ),
-        required_investigations=3,
-        customer_tier="premium",
-        affected_users_estimate=6_400,
-        revenue_impact_usd_per_min=1_100,
-        requires_mitigation=True,
-        postmortem_required=True,
-    )
-def _partial_publish_feed() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-H10",
-        title="Real-time feed gaps during partial publish",
-        description=(
-            "Premium trading-floor customers see gaps in the realtime price "
-            "feed after a publisher restart; some updates never arrived."
-        ),
-        category="realtime",
-        difficulty="hard",
-        root_cause="partial_publish_without_transaction_boundary",
-        root_cause_synonyms=(
-            "partial publish without transaction boundary",
-            "publisher crash mid batch",
-            "realtime feed gap",
-        ),
-        clue_keywords=("publish", "transaction", "feed", "partial"),
-        signals=(
-            "Sequence numbers skip in a bounded window around the publisher restart",
-            "Replay API can fill the gap but live subscribers missed it",
-        ),
-        logs={
-            "price-publisher": "Process restarted mid-batch, seq=88230 not flushed",
-            "realtime-bus": "Detected sequence gap 88230-88236 on channel=prices.us",
-        },
-        red_herring_logs={
-            "auth-service": "steady",
-        },
-        metrics={
-            "dash-realtime": "gap_count 6 in 30s, subscriber_reconcile_lag_s 48",
-        },
-        kb={
-            "kb-publish-txn": "Wrap each batch in a transactional publish so crashes never leave gaps.",
-        },
-        good_handoff="investigator_agent",
-        accepted_fix_keywords=(
-            ("enable", "transactional", "publish"),
-            ("replay", "sequence", "gap"),
-            ("force", "subscriber", "reconcile"),
-        ),
-        required_investigations=3,
-        customer_tier="premium",
-        affected_users_estimate=3_900,
-        revenue_impact_usd_per_min=1_750,
-        requires_mitigation=True,
-        postmortem_required=True,
-    )
-def _ssd_firmware_regression() -> IncidentTemplate:
-    return IncidentTemplate(
-        id="INC-H11",
-        title="Storage checksum failures on upgraded SSD fleet",
-        description=(
-            "Enterprise object storage is returning checksum-mismatch errors "
-            "on a subset of volumes after a firmware roll-forward."
-        ),
-        category="storage",
-        difficulty="hard",
-        root_cause="ssd_firmware_checksum_regression",
-        root_cause_synonyms=(
-            "ssd firmware checksum regression",
-            "storage firmware corruption",
-            "nvme firmware crc bug",
-        ),
-        clue_keywords=("firmware", "ssd", "checksum", "storage"),
-        signals=(
-            "Checksum failures concentrate on volumes upgraded in the last 72 hours",
-            "Vendor advisory mentions similar symptoms after firmware F2.14",
-        ),
-        logs={
-            "storage-agent": "CRC mismatch on volume vol-221 firmware=F2.14",
-            "fleet-manager": "Upgrade batch included F2.14 for 18 volumes",
-        },
-        red_herring_logs={
-            "email-service": "steady",
-        },
-        metrics={
-            "dash-storage": "checksum_error_rate 0.8%",
-            "dash-fleet": "volumes_on_F2.14 18, volumes_healthy 402",
-        },
-        kb={
-            "kb-ssd-firmware": "Quarantine affected firmware and roll back to the last known-good version.",
-        },
-        good_handoff="ops_manager_agent",
-        accepted_fix_keywords=(
-            ("rollback", "ssd", "firmware"),
-            ("quarantine", "affected", "volumes"),
-            ("reseed", "checksum", "index"),
-        ),
-        required_investigations=3,
-        customer_tier="enterprise",
-        affected_users_estimate=1_800,
-        revenue_impact_usd_per_min=1_950,
-        requires_mitigation=True,
-        postmortem_required=True,
-    )
 def build_incident_library() -> IncidentLibrary:
-    """Return the built-in enterprise incident library (30 templates)."""
     return IncidentLibrary(
         templates_by_task={
-            "easy": [
-                _redis_pool(),
-                _jwt_clock_skew(),
-                _email_spam_false_positive(),
-                _dns_ttl_stale(),
-                _cdn_purge_scope(),
-                _autocomplete_stale(),
-                _webhook_retry_budget(),
-                _thumbnail_worker_oom(),
-            ],
             "medium": [
                 _cache_invalidation_lag(),
                 _tz_normalization(),
                 _invoice_idempotency(),
                 _tls_expiry(),
                 _feature_flag_rollout(),
-                _recommender_heap_leak(),
-                _consumer_group_rebalance(),
-                _config_push_skipped_canary(),
-                _health_check_flapping(),
-                _payment_webhook_dedupe(),
-                _origin_shield_bypass(),
             ],
             "hard": [
                 _promo_rate_cascade(),
@@ -1736,12 +868,6 @@ def build_incident_library() -> IncidentLibrary:
                 _alert_storm(),
                 _inventory_race(),
                 _deadlock_database(),
-                _gpu_memory_fragmentation(),
-                _replication_saturation(),
-                _cache_key_collision(),
-                _cron_dst_double_trigger(),
-                _partial_publish_feed(),
-                _ssd_firmware_regression(),
             ],
         }
     )

     )
 def build_incident_library() -> IncidentLibrary:
+    """Return the built-in enterprise incident library."""
     return IncidentLibrary(
         templates_by_task={
+            "easy": [_redis_pool(), _jwt_clock_skew(), _email_spam_false_positive()],
             "medium": [
                 _cache_invalidation_lag(),
                 _tz_normalization(),
                 _invoice_idempotency(),
                 _tls_expiry(),
                 _feature_flag_rollout(),
             ],
             "hard": [
                 _promo_rate_cascade(),
                 _alert_storm(),
                 _inventory_race(),
                 _deadlock_database(),
             ],
         }
     )