Spaces:

Torchflow1
/

Multi-Agent-Incident-Command-Center

Running

App Files Files Community

SwapnilPatil28 commited on 23 days ago

Commit

02f3541

verified ·

1 Parent(s): 6883897

Final Update Dashboard

Browse files

Files changed (4) hide show

README.md +11 -10
docs/BLOG_POST.md +254 -0
docs/SUBMISSION_CHECKLIST.md +124 -0
server/app.py +208 -42

README.md CHANGED Viewed

@@ -112,8 +112,8 @@ Same pipeline, same data recipe, smaller backbone:
 | 🟢 **Live environment** | **[Open the dashboard ↗](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** |
 | 💻 **Source code** | **[GitHub repo ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
 | 🎓 **Reproduce the training** | **[One-click Colab notebook ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
-| 📺 **2-minute video walkthrough** | *Coming soon — shot list in [`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md)* |
-| 📝 **Mini blog post** | *Coming soon — full draft in [`docs/BLOG_POST.md`](./docs/BLOG_POST.md)* |
 > Want the rubric math, architecture, full numbers, configuration, and the hackathon checklist? Keep scrolling — **Part 2** is the full technical README.
@@ -129,8 +129,9 @@ Same pipeline, same data recipe, smaller backbone:
 | Hugging Face Space page | **[`huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
 | GitHub repository | **[`github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
 | Training notebook (Colab T4, one-click reproducible) | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
-| 2-minute video walkthrough | *Coming soon — [`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md) has the shot list* |
-| Mini blog post | *Coming soon — full draft in [`docs/BLOG_POST.md`](./docs/BLOG_POST.md), ready to publish on hf.co/blog* |
 | Training script (Python) | [`train_trl.py`](./train_trl.py) |
 Three specialist agents — **Triage**, **Investigator**, and **Ops Manager** — cooperate to resolve a queue of production incidents while operating under strict **SLA budgets**, **investigation costs**, and **customer-tier impact multipliers**. The environment is designed to reward *real* operational reasoning, not pattern matching on the root-cause label.
@@ -636,9 +637,9 @@ Two scripts judges (or you) can run without a local IDE:
 │   └── before_after_demo.py           # Side-by-side base vs SFT trace generator
 │
 ├── docs/
-│   ├── BLOG_POST.md                   # HF blog draft (publish to hf.co/blog)
-│   ├── VIDEO_SCRIPT.md                # 2-minute YouTube script with link list
-│   └── SUBMISSION_CHECKLIST.md        # Judging-criteria checklist + smoke tests
 │
 ├── artifacts/                         # All committed training evidence
 │   ├── reward_curve.png               # 4-policy reward comparison (1.5B headline)
@@ -706,9 +707,9 @@ Full checklist with pre-submission smoke tests → [`docs/SUBMISSION_CHECKLIST.m
 - [x] **Production-quality HTTP server**: `/healthz`, `/version`, `/env-info`, `/metrics`, Dockerfile with `HEALTHCHECK`
 - [x] **Structured JSON logging** + 12-factor configuration
 - [x] **One-click Colab training notebook** → [Open ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)
-- [x] **Blog draft** ([`docs/BLOG_POST.md`](./docs/BLOG_POST.md)) + **video script** ([`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md))
-- [ ] Publish the Hugging Face blog post and swap the "Coming soon" link in the Live-links table
-- [ ] Upload the YouTube video and swap the "Coming soon" link in the Live-links table
 ---

 | 🟢 **Live environment** | **[Open the dashboard ↗](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** |
 | 💻 **Source code** | **[GitHub repo ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
 | 🎓 **Reproduce the training** | **[One-click Colab notebook ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
+| 📝 **Mini blog post** (the required short writeup) | **[`docs/BLOG_POST.md`](./docs/BLOG_POST.md)** |
+| 🎬 **2-minute video script** (optional bonus) | **[`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md)** |
 > Want the rubric math, architecture, full numbers, configuration, and the hackathon checklist? Keep scrolling — **Part 2** is the full technical README.
 | Hugging Face Space page | **[`huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
 | GitHub repository | **[`github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
 | Training notebook (Colab T4, one-click reproducible) | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
+| Mini blog post (the required short writeup) | [`docs/BLOG_POST.md`](./docs/BLOG_POST.md) |
+| 2-minute video script (optional bonus) | [`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md) |
+| Submission checklist | [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md) |
 | Training script (Python) | [`train_trl.py`](./train_trl.py) |
 Three specialist agents — **Triage**, **Investigator**, and **Ops Manager** — cooperate to resolve a queue of production incidents while operating under strict **SLA budgets**, **investigation costs**, and **customer-tier impact multipliers**. The environment is designed to reward *real* operational reasoning, not pattern matching on the root-cause label.
 │   └── before_after_demo.py           # Side-by-side base vs SFT trace generator
 │
 ├── docs/
+│   ├── BLOG_POST.md                   # The short writeup (rule 4) — renders on HF Space + GitHub
+│   ├── VIDEO_SCRIPT.md                # Optional 2-minute walkthrough script
+│   └── SUBMISSION_CHECKLIST.md        # Judging-criteria status + smoke tests
 │
 ├── artifacts/                         # All committed training evidence
 │   ├── reward_curve.png               # 4-policy reward comparison (1.5B headline)
 - [x] **Production-quality HTTP server**: `/healthz`, `/version`, `/env-info`, `/metrics`, Dockerfile with `HEALTHCHECK`
 - [x] **Structured JSON logging** + 12-factor configuration
 - [x] **One-click Colab training notebook** → [Open ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)
+- [x] **Mini blog post** published as an MD file on both the HF Space and GitHub: [`docs/BLOG_POST.md`](./docs/BLOG_POST.md)
+- [x] **2-minute video script** (optional bonus): [`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md)
+- [x] **Full submission checklist** mapping every rule → evidence: [`docs/SUBMISSION_CHECKLIST.md`](./docs/SUBMISSION_CHECKLIST.md)
 ---

docs/BLOG_POST.md ADDED Viewed

	@@ -0,0 +1,254 @@

+# Teaching LLMs to Run an Incident Command Center
+> *India's Biggest Mega AI Hackathon — Built on Meta OpenEnv · Round 2*
+**TL;DR** — I built an [OpenEnv](https://github.com/meta-pytorch/openenv) environment where three specialist agents — **Triage**, **Investigator**, and **Ops Manager** — cooperate to resolve real-world tech incidents under SLA pressure, budget constraints, and customer-tier business impact. I then fine-tuned Qwen2.5-1.5B-Instruct on heuristic rollouts and watched it **close a +10.17-reward gap on hard incidents**, matching the hand-coded expert policy component-for-component. A separate 0.5B ablation shows that **model scale is the story** — same pipeline, same data schema, but the smaller backbone never closes a single hard incident.
+### 🔗 Everything in one place
+| What | Where |
+|---|---|
+| 🟢 **Live environment (OpenEnv-compatible)** | **[swapnilpatil28-multi-agent-incident-command-center.hf.space ↗](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** |
+| 🤗 **Hugging Face Space page** | **[huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center ↗](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
+| 💻 **GitHub source code** | **[github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
+| 🎓 **Reproducible training (Colab T4)** | **[Open in Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
+| 📖 **Full README** (story + technical deep-dive) | **[github.com/.../README.md ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center#readme)** |
+| 🎬 **2-min video walkthrough script** (optional bonus) | [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) |
+| ✅ **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |
+---
+## 1. The story in 2 minutes (for anyone)
+**When a real tech company has an outage, three people's phones buzz at once.** A Triage engineer scans logs and dashboards. An Investigator forms a hypothesis and applies a fix. An Ops Manager decides who owns the work, whether to escalate, and when to officially close the incident.
+Each role has **different permissions**, **different information needs**, and a **different clock to beat**. Get it wrong and you bleed budget, bust the SLA, and — if the customer is on an enterprise contract — lose serious money (~3× what a free-tier outage costs).
+I built a simulator of that war room — an **OpenEnv-compatible** environment with 13 realistic incidents, 3 specialist roles, and 14+ named reward signals — and fine-tuned an LLM to run it.
+| Role | Can do | Cannot do |
+|---|---|---|
+| 🔍 **Triage agent** | Pull logs · check metrics · consult KB articles | Close a ticket |
+| 🧪 **Investigator** | Apply a fix · roll back a deploy | Escalate or file a post-mortem |
+| 👷 **Ops Manager** | Escalate · file post-mortem · **close the ticket** | Apply a code fix |
+> **The headline number:** the fine-tuned LLM earns **+10.17 more reward on hard incidents** than the untrained base — and matches the human-written expert policy component-for-component.
+![Reward curve comparing random, base LLM, fine-tuned LLM, and heuristic on easy, medium, and hard tasks](https://raw.githubusercontent.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/main/artifacts/reward_curve.png)
+One picture, four policies, three difficulty tiers. Random is the floor. The untuned base LLM plateaus because it never learns to actually **close** an incident. The fine-tuned model climbs sharply with difficulty and catches the hand-coded expert exactly.
+---
+## 2. Why this is a real RL problem (three themes in one environment)
+Most RL environments for LLMs are single-agent, single-step, or turn-based games. Real enterprise work is none of those. This environment deliberately tests all three Round-2 hackathon themes simultaneously:
+| Hackathon theme | How this environment satisfies it |
+|---|---|
+| **🤝 #1 — Multi-Agent Interactions** | Three distinct specialist roles with **non-overlapping permissions**. Acting out-of-role triggers a `wrong_actor_penalty` (−0.08). Correct handoffs earn `+0.15`. Collaboration is **trained**, not hard-coded. |
+| **⏱️ #2 — Long-Horizon Planning** | Each episode carries **3–5 sequential incidents**, 20–60 steps apiece, under a single ticking SLA clock. The big reward (+0.80 × tier) only fires after clues → fix → post-mortem. **Sparse and delayed by design** — the 20-step credit-assignment problem is the whole point. |
+| **🏢 #3 — World Modeling / Professional Tasks** | Incidents carry **real logs, metrics, KB articles, red-herring signals**, and **business metadata** (customer tier, affected users, $/min revenue impact). Closure rewards scale by tier (free ×0.6 · standard ×1.0 · premium ×1.4 · **enterprise ×1.8**), and wrong closures are punished the same way. Close an enterprise ticket incorrectly and it hurts **~3× what a free-tier one does**. |
+---
+## 3. What the environment looks like under the hood
+The environment runs as a standard OpenEnv FastAPI server — same Gym-style `reset / step` contract, same Pydantic observation/action schemas, same Docker image format for Hugging Face Spaces.
+### Observation (partial)
+```json
+{
+  "incident_id": "inc-cert-expiry",
+  "incident_title": "mTLS cert expired — all microservices throwing 500s",
+  "incident_description": "Alerting fired at 03:12 UTC ...",
+  "customer_tier": "enterprise",
+  "affected_users_estimate": 140000,
+  "revenue_impact_usd_per_min": 4800,
+  "postmortem_required": true,
+  "visible_signals": ["mtls handshake errors", "5xx spike in checkout"],
+  "investigation_targets": {
+    "logs": ["cert-manager", "auth-service"],
+    "metrics": ["dash-mesh", "dash-auth"],
+    "kb": ["kb-mtls-chain", "kb-cert-rotation"]
+  },
+  "allowed_actors_by_action": {
+    "apply_fix": ["investigator_agent"],
+    "close_incident": ["ops_manager_agent"]
+  },
+  "budget_remaining": 18,
+  "sla_minutes_remaining": 40,
+  "clues_found": 2,
+  "mitigation_applied": false,
+  "reward_components": {"step_cost": -0.04, "clue_bonus": +0.12}
+}
+```
+### Action space
+| action_type | Typical actor | Purpose |
+|---|---|---|
+| `inspect_logs` / `inspect_metrics` / `consult_kb` | triage / investigator | Gather clues (reward shapes here) |
+| `negotiate_handoff` | ops_manager | Route to correct owner |
+| `apply_fix` | investigator | Apply mitigation (scored vs ground truth) |
+| `rollback` | investigator | Revert last change |
+| `escalate` | ops_manager | Engage senior staff |
+| `submit_postmortem` | ops_manager | Required on tier-1 / high-revenue incidents |
+| `close_incident` | ops_manager | Terminal action — final score depends on clues found + mitigation quality + post-mortem + speed |
+### Reward rubric (composable, not monolithic)
+The reward engine emits **named components** at every step so training curves — and judges — can see *exactly where reward came from*:
+| Component | When it fires | Sign |
+|---|---|---|
+| `step_cost` | Every action (−0.01 to −0.08 by action type) | − |
+| `clue_bonus` | Unique log/metric/KB lookup that surfaces a real fact | **+** |
+| `handoff_correct` / `handoff_wrong` | Ops manager routes to allowed / disallowed owner | ± |
+| `mitigation_correct` / `mitigation_wrong` / `mitigation_empty` | Fix matches / contradicts / omits ground-truth keywords | ± |
+| `rollback_effective` / `rollback_ineffective` | Rollback summary matches the incident's accepted playbook | ± |
+| `escalation_needed` / `escalation_not_needed` | Escalation raised for an incident that actually warrants it | ± |
+| `closure_correct` / `closure_wrong` | Final close decision matches incident state | ± (scaled by customer tier) |
+| `closure_mitigation_bonus` | Close after a correct mitigation | **+** |
+| `closure_under_investigated` | Close without enough clues found | − |
+| `speed_bonus` | Close in ≤ 7 / ≤ 4 steps | **+** |
+| `postmortem_bonus` / `postmortem_missing` | Post-mortem filed / skipped on a high-impact incident | ± |
+| `repeated_lookup_penalty` | Re-querying the same log/metric/KB | − |
+| `wrong_actor_penalty` | Action invoked by a role that's not authorised | − |
+| `invalid_action` | Unrecognised `action_type` | − |
+| `sla_exhausted` / `budget_exhausted` | Terminal penalty when SLA / action budget hits zero | − |
+**Anti-gaming:** closing early with zero clues is penalised; spamming cheap `inspect_logs` racks up `repeated_lookup_penalty`; triggering `apply_fix` without investigator permissions gives `wrong_actor_penalty`. A policy cannot shortcut its way to a high score.
+---
+## 4. Training: HF TRL SFT on heuristic rollouts
+I first wrote a deterministic `HeuristicCoordinator` that uses the observation's `investigation_targets` and role constraints to play through the environment. On hard tasks it earns **+5.89** reward where random scores **−12.50** — so that gives us ~680 `(prompt, completion)` pairs of "good" behavior to imitate.
+Training script: [`train_trl.py`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/train_trl.py). One command on Colab T4 (or **[open the reproducible notebook ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)**) runs the entire pipeline:
+```python
+os.environ["BASE_MODEL"]         = "Qwen/Qwen2.5-1.5B-Instruct"
+os.environ["EPISODES_PER_TASK"]  = "8"
+os.environ["TRAIN_EPOCHS"]       = "3"
+os.environ["EVAL_LLM_MODELS"]    = "true"
+os.environ["MAX_LLM_EVAL_STEPS"] = "120"
+!python train_trl.py
+```
+The script:
+1. Rolls out the heuristic against the live environment and collects prompts/completions.
+2. Runs TRL `SFTTrainer` with a single `text` column (chat-template applied).
+3. Saves the fine-tuned checkpoint to `artifacts/sft_model/`.
+4. Rolls out **four** policies under identical seeds — random, heuristic, base LLM, fine-tuned LLM.
+5. Writes `reward_curve.png`, `training_curve.png`, `reward_components.png`, `summary_metrics.json`, and `training_log.json`.
+### Training loss + token accuracy
+![SFT training loss dropping from ~2.84 to ~0.02 and token accuracy climbing from ~0.49 to ~0.99 over 3 epochs](https://raw.githubusercontent.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/main/artifacts/training_curve.png)
+Loss drops from **~2.84 → ~0.02** over three epochs as the model learns the structured JSON action format. Mean token accuracy climbs from **~0.49 → ~0.99**. Satisfies the hackathon "loss AND reward plots" minimum requirement.
+### Four-policy reward comparison
+![Reward curve comparing random, base LLM, fine-tuned LLM, and heuristic on easy, medium, and hard tasks](https://raw.githubusercontent.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/main/artifacts/reward_curve.png)
+| Task | Random | Base LLM | **Fine-tuned (SFT)** | Heuristic |
+|---|---:|---:|---:|---:|
+| easy | −5.96 | −2.92 | **−4.72** | −4.72 |
+| medium | −11.48 | −4.00 | **−0.87** | −0.87 |
+| hard | −12.50 | −4.28 | **+5.89** | +5.89 |
+**Fine-tuned vs untrained base: +10.17 reward delta on hard-difficulty incidents.**
+- **Random** is the floor on every task.
+- **Base LLM** already beats random on easy because it produces well-formed-ish JSON — but it never closes a single incident, so it just racks up step-costs and SLA penalties.
+- **Fine-tuned LLM** catches the heuristic teacher exactly. The environment is deterministic and SFT hit ~0.99 token accuracy, so the student literally reproduces the teacher's action sequence under greedy decoding. This is **imitation learning converging to the expert** — the meaningful headline number is therefore **SFT vs base**, not SFT vs heuristic.
+### Reward sources — what each policy actually earns
+![Stacked-bar chart showing where each policy earns or loses reward, broken down by rubric component](https://raw.githubusercontent.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/main/artifacts/reward_components.png)
+This is the chart I'm proudest of, because it makes the training signal **legible**. Summed across all three tasks:
+- **Random** bleeds out: `closure_wrong: −17.82` · `wrong_actor_penalty: −3.12` · `mitigation_wrong: −2.10`.
+- **Base LLM** earns `clue_bonus: +0.24` but then gets crushed by `step_cost: −5.16` and `sla_exhausted: −5.04`. It **never fires a single positive closure component**.
+- **Fine-tuned LLM** unlocks the high-value positive components the base never sees: `closure_correct: +7.36` · `mitigation_correct: +2.10` · `closure_mitigation_bonus: +1.80` · `postmortem_bonus: +0.60` · `handoff_correct: +0.75` · `speed_bonus: +0.60`.
+**Training has moved the LLM from "bleeding" to "solving."**
+---
+## 5. Why does SFT exactly match the heuristic?
+Honest framing matters. The environment is deterministic (same task → same incidents → same observations → same seeds). The heuristic coordinator is also deterministic (same observation → same action). So every rollout of a given task produces a byte-identical trajectory. Our 680-row dataset contains only ~85 *unique* `(observation, action)` pairs, each duplicated for redundancy. At ~0.99 token accuracy after 3 epochs, the LLM **memorises** the heuristic's policy, and under greedy decoding at eval time it reproduces that policy token-for-token on the same deterministic environment.
+> **This is the defining success condition for behavior cloning: the student has become the teacher.**
+The gap we can legitimately celebrate is therefore **SFT vs the untrained base model**, where:
+- On **hard incidents**, SFT earns **+10.17** more reward than base.
+- SFT **unlocks** reward components (`closure_correct`, `mitigation_correct`, `postmortem_bonus`) that the base model literally never fires.
+- On easy tasks, SFT inherits the teacher's known weakness (easy tasks have tight SLA budgets that punish thorough investigation). This is exactly what imitation learning should do — including the teacher's mistakes.
+The obvious next step to go **beyond** the heuristic ceiling is RL with the environment's native reward signal — GRPO or PPO against the same rubric — which is the natural Round 3 work.
+---
+## 6. The surprise finding — scale is the story
+I ran the exact same pipeline with the smaller **Qwen2.5-0.5B-Instruct** backbone (same environment, same seeds, same heuristic teacher, same reward rubric). The story flips entirely:
+![Reward curve for the 0.5B ablation — SFT barely improves over base and never closes a hard incident](https://raw.githubusercontent.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/main/artifacts/reward_curve_qwen0p5b.png)
+| Task | Random | Base **0.5B** | **SFT 0.5B** | Heuristic | SFT − Base (0.5B) |
+|---|---:|---:|---:|---:|---:|
+| easy | −5.96 | −2.92 | **−2.49** | −4.72 | **+0.43** |
+| medium | −11.48 | −4.00 | **−3.86** | −0.87 | **+0.14** |
+| hard | −12.50 | −2.40 | **−2.40** | +5.89 | **+0.00** |
+**The punchline:** with a 0.5B backbone, SFT delivers only a **+0.43 / +0.14 / +0.00** improvement over the base model and **never closes a single hard incident**. Bumping the backbone to **1.5B** — same SFT code, same data pipeline, same environment — unlocks a **−1.80 / +3.13 / +10.17** improvement and makes the LLM match the heuristic's component-for-component behavior on hard incidents.
+| Run config | 0.5B | **1.5B (headline)** |
+|---|---|---|
+| Base model | Qwen2.5-0.5B-Instruct | Qwen2.5-1.5B-Instruct |
+| Episodes / task (rollout) | 3 | 8 |
+| Dataset rows | 255 | 680 |
+| Train epochs | 1 | 3 |
+| Base → SFT improvement on **hard** | **+0.00** | **+10.17** |
+| Hard incidents closed by SFT | **0** | **full heuristic behavior** |
+**Interpretation:** at 0.5B the model is *too small* to absorb this multi-step, role-gated policy from SFT, even though it can emit syntactically valid JSON. At 1.5B the capacity suddenly becomes sufficient to internalise the full action schedule, and behavior cloning converges. **This is the kind of finding the environment is designed to surface — the composable rubric makes it visible in one plot, not hidden behind a single aggregate score.**
+---
+## 7. Everything you need to reproduce this
+| | |
+|---|---|
+| **Live environment** | [swapnilpatil28-multi-agent-incident-command-center.hf.space](https://swapnilpatil28-multi-agent-incident-command-center.hf.space) (OpenEnv-compatible, Docker-backed) |
+| **Training notebook** | [One-click Colab (T4, ~1 h 15 min end-to-end)](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing) |
+| **Source + tests** | [GitHub repo (21 passing tests, Dockerfile with HEALTHCHECK)](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
+| **Full docs** | [README — Part 1 story + Part 2 technical deep-dive](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center#readme) |
+| **Committed evidence** | [`artifacts/`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/tree/main/artifacts) — all 4 PNGs + both JSON metric files |
+| **2-min video script** (optional bonus) | [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) |
+| **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |
+---
+## 8. What's next
+- **Replace SFT with GRPO or PPO** using the environment's native reward signal — no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
+- **Scale the incident catalog** from 13 templates to 50+ (drop in JSON-defined scenarios).
+- **Add a second "adversarial" agent** that injects misleading signals to test robustness.
+- **Record the 2-minute walkthrough** from [`docs/VIDEO_SCRIPT.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/VIDEO_SCRIPT.md) as a bonus companion to this writeup.
+If you want to run it yourself, the Space and the repo are fully self-contained — `docker run` the image and point any OpenEnv-compatible client at it. Or just hit `/reset` and `/step` yourself from any language that can speak HTTP JSON.
+---
+*Built with ♥ on [Meta OpenEnv](https://github.com/meta-pytorch/openenv) for the OpenEnv India 2026 Round 2 hackathon.*
+*Code: [GitHub](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) · Space: [HF Space](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center) · Training notebook: [Colab](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing).*

docs/SUBMISSION_CHECKLIST.md ADDED Viewed

	@@ -0,0 +1,124 @@

+# Submission Checklist — OpenEnv India 2026 Round 2
+Status against every hard gate in the official judging rules, plus every polish item that moves the judging needle. **Last verified: all 21 tests passing, HF Space live, all artifacts committed.**
+---
+## Hard gates (from the official rules)
+| # | Rule | Status | Evidence |
+|---|---|---|---|
+| 1 | **Use OpenEnv (latest release). Build on top of the framework; don't reinvent the wheel.** | ✅ | `requirements.txt` pins `openenv-core>=0.2.2`, `openenv.yaml` has `version: "3.0"`, `server/environment.py` extends `openenv.core.environment.Environment`, app built via `openenv.core.env_server.create_fastapi_app`. |
+| 2 | **Working training script (Unsloth / HF TRL / any RL framework), ideally as a Colab notebook so judges can re-run it.** | ✅ | [`train_trl.py`](../train_trl.py) uses HF TRL `SFTTrainer`. **[One-click Colab notebook ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** runs the whole pipeline end-to-end on a T4 in ~1 h 15 min. |
+| 3 | **Evidence that you actually trained: at minimum, loss and reward plots from a real run.** | ✅ | Four plots committed to [`artifacts/`](../artifacts): `training_curve.png` (loss + token accuracy), `reward_curve.png` (4-policy reward by tier), `reward_components.png` (per-component breakdown), plus the 0.5B ablation `reward_curve_qwen0p5b.png`. Full `training_log.json` + `summary_metrics.json` committed alongside. |
+| 4 | **Short writeup or video: mini-blog on Hugging Face OR <2-min YouTube video, linked from README.** | ✅ | Mini-blog lives as [`docs/BLOG_POST.md`](./BLOG_POST.md) — shipped as part of the HF Space (rule 4 says "mini-blog on Hugging Face"; the Space is on HF and contains this file, so it renders at `huggingface.co/spaces/.../blob/main/docs/BLOG_POST.md`). All four training plots render inline via raw GitHub URLs. README and dashboard both link to it. A 2-minute walkthrough script is also committed at [`docs/VIDEO_SCRIPT.md`](./VIDEO_SCRIPT.md) as a bonus. |
+| 5 | **Push your environment to a Hugging Face Space so it's discoverable and runnable.** | ✅ | **Live at [`swapnilpatil28-multi-agent-incident-command-center.hf.space`](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** · Space page: [`huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center). |
+| 6 | **README motivates the problem, explains how the env works, and shows results.** | ✅ | [`README.md`](../README.md) — Part 1 ("Story in 2 minutes") opens with the problem in plain English, walks through the environment via role-permission tables, and shows all four plots + headline numbers. Part 2 is the full technical deep-dive (architecture, action/observation spaces, reward rubric, training pipeline, 0.5B ablation, ops/observability, testing, repo layout). |
+| 7 | **README links to the HF Space + all additional materials (video, blog, slides, etc.).** | ✅ | "Live links" table inside Part 2 of the README lists every resource. Part 1 also has a "Try it in 30 seconds" CTA table. The dashboard header plus "Resources & documentation" grid surface the same links from the live Space itself. |
+| 8 | **Do not include big video files in the HF submission — only public URLs.** | ✅ | No video files committed. All assets in [`artifacts/`](../artifacts) are PNG plots (≤ 162 KB each) + JSON. Repo weight is dominated by text and small images. |
+---
+## Judging-rubric alignment
+### Environment Innovation (40%)
+- [x] Multi-role, multi-agent — `triage_agent`, `investigator_agent`, `ops_manager_agent` with **non-overlapping permissions** (`server/domain/roles.py`).
+- [x] Long-horizon — 3–5 sequential incidents per episode, 20–60 steps each, shared SLA + budget counters.
+- [x] Professional / enterprise task simulation — realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
+- [x] 13 unique incident templates across easy / medium / hard (`server/domain/incidents.py`).
+- [x] Rich observation schema — customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints, `reward_components`, `last_action_notes`.
+- [x] Composable reward rubric with **14+ named components** and anti-gaming safeguards (`server/domain/reward.py`).
+- [x] Tier-weighted business impact (`free ×0.6 · standard ×1.0 · premium ×1.4 · enterprise ×1.8`).
+- [x] Role-based permissions + handoff scoring (`wrong_actor_penalty`, `handoff_correct`/`handoff_wrong`).
+### Storytelling (30%)
+- [x] README **Part 1 — The story in 2 minutes** written in plain English, readable by a non-technical judge in under 3 minutes.
+- [x] Every plot has a one-line caption explaining what it shows.
+- [x] Blog post [`docs/BLOG_POST.md`](./BLOG_POST.md) — eight labelled sections, four plots inline via raw GitHub URLs (render everywhere), 0.5B-vs-1.5B ablation narrative, explicit hackathon-theme mapping.
+- [x] Live HF Space dashboard has a **"Story in 2 minutes"** hero panel at the top, a role-permission table, a three-card theme mapping, and a "Resources & documentation" grid with 8 click-through links.
+- [x] Video script [`docs/VIDEO_SCRIPT.md`](./VIDEO_SCRIPT.md) committed (optional bonus; the blog satisfies the writeup rule by itself).
+- [x] All documentation cross-links cleanly — README ↔ dashboard ↔ blog post ↔ video script ↔ checklist.
+### Improvement in Rewards (20%)
+- [x] 4-policy reward curve (`reward_curve.png`) across easy / medium / hard.
+- [x] Training loss + token-accuracy curve (`training_curve.png`).
+- [x] Reward-components stacked bar chart (`reward_components.png`) — shows *where* the improvement came from.
+- [x] Ablation plot (`reward_curve_qwen0p5b.png`) for Qwen2.5-0.5B-Instruct backbone.
+- [x] Per-task `improvement_sft_over_base` numbers in `summary_metrics.json`: **−1.80 / +3.13 / +10.17** (easy / medium / hard).
+- [x] Final headline run: Qwen2.5-1.5B-Instruct, 8 episodes/task, 3 epochs, 680 rows — full `training_log.json` committed.
+### Reward & Training Pipeline (10%)
+- [x] Reward logic is coherent — rubric engine with module-level constants and unit tests (`tests/test_reward.py`).
+- [x] Training pipeline genuinely connects to the running environment (no static dataset — rollouts collected from live `IncidentCommandCenterEnvironment`).
+- [x] SFT checkpoint is saved to `artifacts/sft_model/` and reloaded for 4-policy evaluation — closes the loop.
+- [x] 21 unit + integration tests passing (`tests/test_reward.py`, `tests/test_incidents.py`, `tests/test_environment.py`).
+---
+## Engineering table-stakes
+- [x] Uses OpenEnv `Environment` base class properly.
+- [x] Clean client/server separation — client only uses Pydantic models + HTTP (`client.py`).
+- [x] Gym-style `reset / step / state` + OpenEnv `/close`.
+- [x] Valid `openenv.yaml` manifest (version 3.0).
+- [x] No reserved MCP tool names.
+- [x] Structured JSON logging with per-episode seeded RNG (`server/logging_utils.py`).
+- [x] Health / version / env-info / metrics endpoints (`/healthz`, `/version`, `/env-info`, `/metrics`).
+- [x] Static `/artifacts` mount so the Space serves its own plots — no external hotlinking.
+- [x] Dockerfile with `HEALTHCHECK` (`Dockerfile`, `server/Dockerfile`).
+- [x] `pytest` passes cleanly: 21 / 21.
+- [x] `.dockerignore` keeps image slim (excludes `sft_model/` checkpoint, keeps evidence plots).
+- [x] `pre_validate.sh` + `validate-submission.sh` for one-command pre-submission smoke tests.
+- [x] LICENSE (MIT) in repo root.
+---
+## Final submission steps
+| # | Step | Status |
+|---|---|---|
+| 1 | Final training run (Qwen2.5-1.5B, 8 eps/task, 3 epochs) → all artifacts committed | ✅ |
+| 2 | Commit artifacts (`reward_curve.png`, `training_curve.png`, `reward_components.png`, `reward_curve_qwen0p5b.png`, `training_log.json`, `summary_metrics.json`, `summary_metrics_qwen0p5b.json`) | ✅ |
+| 3 | Update README with real numbers + real Space / Colab / GitHub / blog / video-script links | ✅ |
+| 4 | Deploy HF Space from the same commit | ✅ |
+| 5 | Dashboard upgraded: hero story panel, 4 stacked plots, resources grid with README / blog / video-script / checklist links | ✅ |
+| 6 | Blog post updated (`docs/BLOG_POST.md`) with fixed image paths (raw GitHub URLs) and 0.5B ablation section | ✅ |
+| 7 | All 21 tests passing on latest commit | ✅ |
+| 8 | Run `openenv validate` remotely against the Space — `./validate-submission.sh <space-url>` | ⬜ (run it once before the deadline) |
+| 9 | **Submit the Space URL in the hackathon form:** `https://swapnilpatil28-multi-agent-incident-command-center.hf.space` | ⬜ |
+| 10 | Do not push commits after the submission deadline — post-deadline commits won't be considered | ⬜ |
+---
+## Pre-submission smoke test (copy-paste)
+```bash
+# 1. HF Space is serving
+curl -fsS https://swapnilpatil28-multi-agent-incident-command-center.hf.space/healthz
+# 2. Env-info endpoint advertises metadata
+curl -s https://swapnilpatil28-multi-agent-incident-command-center.hf.space/env-info | head -20
+# 3. OpenEnv validator passes remotely
+./validate-submission.sh https://swapnilpatil28-multi-agent-incident-command-center.hf.space
+# 4. A remote episode works
+ENV_URL=https://swapnilpatil28-multi-agent-incident-command-center.hf.space python inference.py | head -40
+```
+## Where the judges will find each artefact
+| Artefact | Primary URL |
+|---|---|
+| Live environment (OpenEnv-compatible) | [`swapnilpatil28-multi-agent-incident-command-center.hf.space`](https://swapnilpatil28-multi-agent-incident-command-center.hf.space) |
+| Hugging Face Space page | [Space page ↗](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
+| GitHub repository | [GitHub ↗](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
+| README (Part 1 story + Part 2 deep-dive) | [`README.md`](../README.md) |
+| Mini blog post (MD file in the repo, renders on both HF Space and GitHub) | [`docs/BLOG_POST.md`](./BLOG_POST.md) |
+| Reproducible training notebook | [Colab ↗](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing) |
+| Training evidence (all 4 plots + JSON metrics) | [`artifacts/`](../artifacts) folder |
+| 2-minute video script (optional bonus) | [`docs/VIDEO_SCRIPT.md`](./VIDEO_SCRIPT.md) |

server/app.py CHANGED Viewed

@@ -45,10 +45,19 @@ _CONFIG = EnvConfig.from_env()
 configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
 # External URLs surfaced on the dashboard so judges can jump straight from
-# the HF Space to the GitHub / Colab / training artifacts.
 GITHUB_URL = "https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center"
 SPACE_PAGE_URL = "https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center"
 COLAB_URL = "https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing"
 app = create_fastapi_app(
     IncidentCommandCenterEnvironment,
@@ -313,36 +322,9 @@ def _dashboard_html() -> str:
     </div>
 """
-    # --- Theme-mapping block (Multi-Agent / Long-Horizon / Professional) -----
-    themes_html = """
-    <h2>Hackathon theme mapping</h2>
-    <div class='grid grid-3'>
-      <div class='card'>
-        <h3>Theme #1 — Multi-Agent Interactions</h3>
-        <p class='sub'>
-          Three gated specialist roles (triage, investigator, ops manager) exchange
-          structured handoffs. Acting out-of-role triggers a
-          <code>wrong_actor_penalty</code>, so collaboration is trained, not hard-coded.
-        </p>
-      </div>
-      <div class='card'>
-        <h3>Theme #2 — Long-Horizon Planning</h3>
-        <p class='sub'>
-          Episodes span up to 28 steps across stacked incidents with delayed,
-          sparse rewards (closure &amp; post-mortem) and per-tier budget / SLA
-          constraints — a proper credit-assignment stress test.
-        </p>
-      </div>
-      <div class='card'>
-        <h3>Theme #3 — World Modeling / Professional Tasks</h3>
-        <p class='sub'>
-          A realistic enterprise incident-response simulation with customer tiers,
-          rollbacks, escalation policies, post-mortems, and a transparent,
-          anti-gamed reward rubric.
-        </p>
-      </div>
-    </div>
-"""
     # --- Reward-rubric details ----------------------------------------------
     reward_rubric_rows = "".join(
@@ -398,19 +380,20 @@ def _dashboard_html() -> str:
     .kpi .lbl {{ color: var(--muted); font-size:0.8rem; }}
     .kpi .num.good {{ color: var(--good); }}
     footer {{ max-width:1200px; margin:2rem auto 0; color:var(--muted); font-size:0.85rem; }}
-    /* Training-evidence plots: one plot per row, full content width,
-       so dense charts (reward curves, stacked bars) stay readable. */
-    .plots {{ display:flex; flex-direction:column; gap:1.5rem; max-width:1200px; margin:0 auto; }}
-    .plots figure {{ background: var(--card); border:1px solid #1f2a44; border-radius: 14px; padding: 1.25rem; margin:0; }}
     .plots figure a {{ display:block; }}
     .plots img {{
       width:100%; height:auto; display:block;
-      max-width:1100px; margin:0 auto;
       border-radius:10px; background:#0b1225;
       transition: transform 0.2s ease;
     }}
     .plots img:hover {{ transform: scale(1.01); }}
-    .plots figcaption {{ color: var(--muted); font-size:0.9rem; margin-top:0.75rem; line-height:1.55; text-align:center; max-width:1000px; margin-left:auto; margin-right:auto; }}
     .table-wrap {{ overflow-x:auto; }}
     table {{ width:100%; border-collapse: collapse; margin-top:0.5rem; font-size:0.9rem; }}
     th, td {{ padding:0.5rem 0.75rem; text-align:left; border-bottom:1px solid #1f2a44; }}
@@ -418,6 +401,39 @@ def _dashboard_html() -> str:
     td.delta {{ font-weight:600; color:#f8fafc; }}
     td.delta.good {{ color: var(--good); }}
     .links {{ display:flex; flex-wrap:wrap; gap:0.5rem; }}
   </style>
 </head>
 <body>
@@ -432,7 +448,9 @@ def _dashboard_html() -> str:
     <div class='links'>
       <a class='pill cta' href='{GITHUB_URL}' target='_blank' rel='noopener'>GitHub</a>
       <a class='pill cta' href='{COLAB_URL}' target='_blank' rel='noopener'>Open in Colab</a>
-      <a class='pill' href='{SPACE_PAGE_URL}' target='_blank' rel='noopener'>Space page</a>
       <span class='pill'>v{_CONFIG.version}</span>
       <span class='pill'>task: easy / medium / hard</span>
     </div>
@@ -440,6 +458,144 @@ def _dashboard_html() -> str:
   <div class='container'>
     <h2>Headline results</h2>
     <div class='grid'>
       <div class='card'>
@@ -585,10 +741,20 @@ def _dashboard_html() -> str:
   </div>
   <footer>
-    Incident Command Center v{_CONFIG.version} · Built on
-    <a href='https://github.com/meta-pytorch/openenv' target='_blank' rel='noopener'>OpenEnv</a>
-    · <a href='{GITHUB_URL}' target='_blank' rel='noopener'>Source on GitHub</a>
-    · <a href='{COLAB_URL}' target='_blank' rel='noopener'>Reproduce training on Colab</a>
   </footer>
   <script>

 configure_logging(level=_CONFIG.log_level, structured=_CONFIG.structured_logging)
 # External URLs surfaced on the dashboard so judges can jump straight from
+# the HF Space to the GitHub / Colab / docs / training artifacts.
 GITHUB_URL = "https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center"
 SPACE_PAGE_URL = "https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center"
+SPACE_APP_URL = "https://swapnilpatil28-multi-agent-incident-command-center.hf.space"
 COLAB_URL = "https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing"
+# Dashboard doc links point at the Hugging Face Space copies of the docs (not
+# GitHub) so a judge who opens the Space stays inside the HF ecosystem. The
+# README on the Space page is rendered directly, so we point at the Space
+# root for it; the other three open the HF file browser.
+README_URL = SPACE_PAGE_URL
+BLOG_POST_URL = f"{SPACE_PAGE_URL}/blob/main/docs/BLOG_POST.md"
+VIDEO_SCRIPT_URL = f"{SPACE_PAGE_URL}/blob/main/docs/VIDEO_SCRIPT.md"
+SUBMISSION_CHECKLIST_URL = f"{SPACE_PAGE_URL}/blob/main/docs/SUBMISSION_CHECKLIST.md"
 app = create_fastapi_app(
     IncidentCommandCenterEnvironment,
     </div>
 """
+    # Theme mapping now lives in the top story block — keep this var empty
+    # so the existing `{themes_html}` slot renders to nothing (no duplication).
+    themes_html = ""
     # --- Reward-rubric details ----------------------------------------------
     reward_rubric_rows = "".join(
     .kpi .lbl {{ color: var(--muted); font-size:0.8rem; }}
     .kpi .num.good {{ color: var(--good); }}
     footer {{ max-width:1200px; margin:2rem auto 0; color:var(--muted); font-size:0.85rem; }}
+    /* Training-evidence plots: one plot per row, centred, with a tighter
+       max-width so the charts read as compact figures rather than banners.
+       Click the image to open the full-resolution PNG in a new tab. */
+    .plots {{ display:flex; flex-direction:column; gap:1.25rem; max-width:1200px; margin:0 auto; }}
+    .plots figure {{ background: var(--card); border:1px solid #1f2a44; border-radius: 14px; padding: 1rem 1.25rem; margin:0; }}
     .plots figure a {{ display:block; }}
     .plots img {{
       width:100%; height:auto; display:block;
+      max-width:720px; margin:0 auto;
       border-radius:10px; background:#0b1225;
       transition: transform 0.2s ease;
     }}
     .plots img:hover {{ transform: scale(1.01); }}
+    .plots figcaption {{ color: var(--muted); font-size:0.9rem; margin-top:0.6rem; line-height:1.55; text-align:center; max-width:720px; margin-left:auto; margin-right:auto; }}
     .table-wrap {{ overflow-x:auto; }}
     table {{ width:100%; border-collapse: collapse; margin-top:0.5rem; font-size:0.9rem; }}
     th, td {{ padding:0.5rem 0.75rem; text-align:left; border-bottom:1px solid #1f2a44; }}
     td.delta {{ font-weight:600; color:#f8fafc; }}
     td.delta.good {{ color: var(--good); }}
     .links {{ display:flex; flex-wrap:wrap; gap:0.5rem; }}
+    /* "Story in 2 minutes" hero panel — plain-English summary for judges. */
+    .hero-card {{
+      background: linear-gradient(135deg, #0f2647 0%, #172a4a 60%, #1f2a44 100%);
+      border: 1px solid #1f2a44; border-radius: 16px;
+      padding: 1.75rem 1.75rem 1.5rem; margin: 0 auto 1.5rem;
+      max-width: 1200px; box-shadow: 0 6px 30px rgba(34,211,238,0.08);
+    }}
+    .hero-card h2 {{ font-size:1.35rem; margin:0 0 0.4rem; color:#f1f5f9; }}
+    .hero-card h3 {{ font-size:1rem; color:#e2e8f0; margin:0 0 0.3rem; }}
+    .hero-card .lede {{
+      font-size:1.02rem; line-height:1.6; color:#e2e8f0;
+      background:#0b1225; border-left: 3px solid var(--accent);
+      padding: 0.9rem 1.1rem; border-radius: 6px; margin: 0.3rem 0 0;
+    }}
+    .hero-card .lede strong {{ color:#f8fafc; }}
+    .hero-card table {{ font-size:0.92rem; }}
+    .hero-card .card {{ background: #0e1a30; }}
+    /* "Resources & documentation" click-through cards. */
+    .res-card {{
+      display:block; color: var(--text); text-decoration:none;
+      background: var(--card); border:1px solid #1f2a44; border-radius:12px;
+      padding: 1rem 1.1rem;
+      transition: transform 0.15s ease, border-color 0.15s ease, box-shadow 0.15s ease;
+    }}
+    .res-card:hover {{
+      border-color: var(--accent); transform: translateY(-2px);
+      box-shadow: 0 8px 24px rgba(34,211,238,0.12);
+      text-decoration:none;
+    }}
+    .res-icon {{ font-size:1.6rem; line-height:1; margin-bottom:0.5rem; }}
+    .res-title {{ font-weight:600; color:#f1f5f9; margin-bottom:0.2rem; }}
   </style>
 </head>
 <body>
     <div class='links'>
       <a class='pill cta' href='{GITHUB_URL}' target='_blank' rel='noopener'>GitHub</a>
       <a class='pill cta' href='{COLAB_URL}' target='_blank' rel='noopener'>Open in Colab</a>
+      <a class='pill cta' href='{README_URL}' target='_blank' rel='noopener'>README</a>
+      <a class='pill cta' href='{BLOG_POST_URL}' target='_blank' rel='noopener'>Blog post</a>
+      <a class='pill' href='{SPACE_PAGE_URL}' target='_blank' rel='noopener'>HF Space page</a>
       <span class='pill'>v{_CONFIG.version}</span>
       <span class='pill'>task: easy / medium / hard</span>
     </div>
   <div class='container'>
+    <!-- ============================================================ -->
+    <!-- PART 1 — Plain-English story for non-technical judges        -->
+    <!-- ============================================================ -->
+    <div class='hero-card'>
+      <h2 style='margin-top:0'>🚨 The story in 2 minutes</h2>
+      <p class='lede'>
+        When a real tech company has an outage, <strong>three people's phones
+        buzz at once</strong> — a Triage engineer, an Investigator, and an Ops
+        Manager. They have to cooperate under a ticking <strong>SLA clock</strong>,
+        every action costs <strong>budget</strong>, and every wrong call costs
+        <strong>real money</strong> (enterprise outages hurt ~3× more than free-tier).
+        <br /><br />
+        We built a simulator of that war room — and we fine-tuned an LLM to run it
+        <strong>as well as the human expert</strong>.
+      </p>
+      <h3 style='margin-top:1.25rem'>What is the environment?</h3>
+      <p class='sub' style='margin:0 0 0.75rem'>
+        Three specialist agents with <strong>different permissions</strong> resolve
+        a live queue of 13 realistic tech incidents across 3 difficulty tiers.
+      </p>
+      <div class='table-wrap'>
+        <table>
+          <thead>
+            <tr><th>Role</th><th>Can do</th><th>Cannot do</th></tr>
+          </thead>
+          <tbody>
+            <tr>
+              <td>🔍 <strong>Triage</strong></td>
+              <td>Pull logs · check metrics · consult KB</td>
+              <td>Close a ticket</td>
+            </tr>
+            <tr>
+              <td>🧪 <strong>Investigator</strong></td>
+              <td>Apply a fix · roll back a deploy</td>
+              <td>Escalate or file a post-mortem</td>
+            </tr>
+            <tr>
+              <td>👷 <strong>Ops Manager</strong></td>
+              <td>Escalate · file post-mortem · <strong>close the ticket</strong></td>
+              <td>Apply a code fix</td>
+            </tr>
+          </tbody>
+        </table>
+      </div>
+      <h3 style='margin-top:1.25rem'>What did the agent learn?</h3>
+      <p class='sub' style='margin:0'>
+        Not "pick the right label." It learned a whole workflow — dig up clues,
+        hand off to the right specialist, apply the correct fix, respect the SLA,
+        file the post-mortem, close the ticket. The rubric makes every piece of
+        that workflow <em>visible</em> as a named reward component, so you can
+        see <em>why</em> the agent earned (or lost) points at every step.
+      </p>
+      <h3 style='margin-top:1.25rem'>Why it matters for the 3 hackathon themes</h3>
+      <div class='grid grid-3'>
+        <div class='card'>
+          <h3>🤝 Theme #1 — Multi-Agent</h3>
+          <p class='sub'>
+            Three distinct roles with <strong>non-overlapping permissions</strong>.
+            Wrong-actor calls → <code>-0.08</code>. Correct handoff → <code>+0.15</code>.
+            Cooperation is <em>trained</em>, not hard-coded.
+          </p>
+        </div>
+        <div class='card'>
+          <h3>⏱️ Theme #2 — Long-Horizon</h3>
+          <p class='sub'>
+            Each episode runs <strong>3–5 sequential incidents</strong> over 20–60
+            steps with a single ticking SLA clock. Big rewards (+0.80 × tier) only
+            fire after clues → fix → post-mortem. Sparse and delayed by design.
+          </p>
+        </div>
+        <div class='card'>
+          <h3>🏢 Theme #3 — Professional World-Model</h3>
+          <p class='sub'>
+            Real logs, metrics, KB articles, red-herring signals, customer tiers,
+            SLA timers, revenue impact. Close an enterprise ticket wrong and it
+            hurts ~3× what a free-tier one does.
+          </p>
+        </div>
+      </div>
+      <p class='sub' style='margin-top:1rem;font-style:italic'>
+        ↓ Keep scrolling for the headline numbers, training plots, ablation, and
+        the full rubric. Or jump straight to the
+        <a href='{README_URL}' target='_blank' rel='noopener'>README</a> or the
+        <a href='{BLOG_POST_URL}' target='_blank' rel='noopener'>blog post</a>.
+      </p>
+    </div>
+    <!-- ============================================================ -->
+    <!-- Resources & documentation — every link the judges need       -->
+    <!-- ============================================================ -->
+    <h2>Resources &amp; documentation</h2>
+    <div class='grid grid-3'>
+      <a class='res-card' href='{GITHUB_URL}' target='_blank' rel='noopener'>
+        <div class='res-icon'>💻</div>
+        <div class='res-title'>GitHub repository</div>
+        <div class='sub'>Full source, tests, Dockerfile, CI-ready</div>
+      </a>
+      <a class='res-card' href='{SPACE_PAGE_URL}' target='_blank' rel='noopener'>
+        <div class='res-icon'>🤗</div>
+        <div class='res-title'>Hugging Face Space page</div>
+        <div class='sub'>Repo view, build logs, discussions</div>
+      </a>
+      <a class='res-card' href='{SPACE_APP_URL}' target='_blank' rel='noopener'>
+        <div class='res-icon'>🟢</div>
+        <div class='res-title'>Live environment</div>
+        <div class='sub'>You are here — OpenEnv endpoints live</div>
+      </a>
+      <a class='res-card' href='{COLAB_URL}' target='_blank' rel='noopener'>
+        <div class='res-icon'>🎓</div>
+        <div class='res-title'>Reproduce training (Colab T4)</div>
+        <div class='sub'>One-click notebook, ~1 h wall clock</div>
+      </a>
+      <a class='res-card' href='{README_URL}' target='_blank' rel='noopener'>
+        <div class='res-icon'>📖</div>
+        <div class='res-title'>README (Part 1 + Part 2)</div>
+        <div class='sub'>Story overview + full technical deep-dive</div>
+      </a>
+      <a class='res-card' href='{BLOG_POST_URL}' target='_blank' rel='noopener'>
+        <div class='res-icon'>📝</div>
+        <div class='res-title'>Mini blog post</div>
+        <div class='sub'>The short writeup — MD file on the HF Space + GitHub</div>
+      </a>
+      <a class='res-card' href='{VIDEO_SCRIPT_URL}' target='_blank' rel='noopener'>
+        <div class='res-icon'>🎬</div>
+        <div class='res-title'>2-minute video script</div>
+        <div class='sub'>Optional bonus — shot list + narration</div>
+      </a>
+      <a class='res-card' href='{SUBMISSION_CHECKLIST_URL}' target='_blank' rel='noopener'>
+        <div class='res-icon'>✅</div>
+        <div class='res-title'>Submission checklist</div>
+        <div class='sub'>Every judging rule → where to find the evidence</div>
+      </a>
+    </div>
     <h2>Headline results</h2>
     <div class='grid'>
       <div class='card'>
   </div>
   <footer>
+    <div>
+      <strong>Incident Command Center v{_CONFIG.version}</strong> · Built on
+      <a href='https://github.com/meta-pytorch/openenv' target='_blank' rel='noopener'>OpenEnv</a>
+      for the OpenEnv India 2026 Round 2 hackathon.
+    </div>
+    <div style='margin-top:0.4rem'>
+      <a href='{GITHUB_URL}' target='_blank' rel='noopener'>GitHub</a> ·
+      <a href='{SPACE_PAGE_URL}' target='_blank' rel='noopener'>HF Space page</a> ·
+      <a href='{COLAB_URL}' target='_blank' rel='noopener'>Colab</a> ·
+      <a href='{README_URL}' target='_blank' rel='noopener'>README</a> ·
+      <a href='{BLOG_POST_URL}' target='_blank' rel='noopener'>Blog post</a> ·
+      <a href='{VIDEO_SCRIPT_URL}' target='_blank' rel='noopener'>Video script</a> ·
+      <a href='{SUBMISSION_CHECKLIST_URL}' target='_blank' rel='noopener'>Submission checklist</a>
+    </div>
   </footer>
   <script>