File size: 18,559 Bytes
02f3541 8cbdbde 02f3541 8062d98 02f3541 8062d98 02f3541 8062d98 02f3541 8cbdbde 02f3541 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 | # Teaching LLMs to Run an Incident Command Center
> *India's Biggest Mega AI Hackathon β Built on Meta OpenEnv Β· Round 2*
**TL;DR** β I built an [OpenEnv](https://github.com/meta-pytorch/openenv) environment where three specialist agents β **Triage**, **Investigator**, and **Ops Manager** β cooperate to resolve real-world tech incidents under SLA pressure, budget constraints, and customer-tier business impact. I then fine-tuned Qwen2.5-1.5B-Instruct on heuristic rollouts and watched it **close a +10.17-reward gap on hard incidents**, matching the hand-coded expert policy component-for-component. A separate 0.5B ablation shows that **model scale is the story** β same pipeline, same data schema, but the smaller backbone never closes a single hard incident.
### π Everything in one place
| What | Where |
|---|---|
| π’ **Live environment (OpenEnv-compatible)** | **[swapnilpatil28-multi-agent-incident-command-center.hf.space β](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** |
| π€ **Hugging Face Space page** | **[huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center β](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
| π» **GitHub source code** | **[github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center β](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
| π **Reproducible training (Colab T4)** | **[Open in Colab β](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
| π **Full README** (story + technical deep-dive) | **[github.com/.../README.md β](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center#readme)** |
| β
**Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |
---
## 1. The story in 2 minutes (for anyone)
**When a real tech company has an outage, three people's phones buzz at once.** A Triage engineer scans logs and dashboards. An Investigator forms a hypothesis and applies a fix. An Ops Manager decides who owns the work, whether to escalate, and when to officially close the incident.
Each role has **different permissions**, **different information needs**, and a **different clock to beat**. Get it wrong and you bleed budget, bust the SLA, and β if the customer is on an enterprise contract β lose serious money (~3Γ what a free-tier outage costs).
I built a simulator of that war room β an **OpenEnv-compatible** environment with 13 realistic incidents, 3 specialist roles, and 14+ named reward signals β and fine-tuned an LLM to run it.
| Role | Can do | Cannot do |
|---|---|---|
| π **Triage agent** | Pull logs Β· check metrics Β· consult KB articles | Close a ticket |
| π§ͺ **Investigator** | Apply a fix Β· roll back a deploy | Escalate or file a post-mortem |
| π· **Ops Manager** | Escalate Β· file post-mortem Β· **close the ticket** | Apply a code fix |
> **The headline number:** the fine-tuned LLM earns **+10.17 more reward on hard incidents** than the untrained base β and matches the human-written expert policy component-for-component.

One picture, four policies, three difficulty tiers. Random is the floor. The untuned base LLM plateaus because it never learns to actually **close** an incident. The fine-tuned model climbs sharply with difficulty and catches the hand-coded expert exactly.
---
## 2. Why this is a real RL problem (three themes in one environment)
Most RL environments for LLMs are single-agent, single-step, or turn-based games. Real enterprise work is none of those. This environment deliberately tests all three Round-2 hackathon themes simultaneously:
| Hackathon theme | How this environment satisfies it |
|---|---|
| **π€ #1 β Multi-Agent Interactions** | Three distinct specialist roles with **non-overlapping permissions**. Acting out-of-role triggers a `wrong_actor_penalty` (β0.08). Correct handoffs earn `+0.15`. Collaboration is **trained**, not hard-coded. |
| **β±οΈ #2 β Long-Horizon Planning** | Each episode carries **3β5 sequential incidents**, 20β60 steps apiece, under a single ticking SLA clock. The big reward (+0.80 Γ tier) only fires after clues β fix β post-mortem. **Sparse and delayed by design** β the 20-step credit-assignment problem is the whole point. |
| **π’ #3 β World Modeling / Professional Tasks** | Incidents carry **real logs, metrics, KB articles, red-herring signals**, and **business metadata** (customer tier, affected users, $/min revenue impact). Closure rewards scale by tier (free Γ0.6 Β· standard Γ1.0 Β· premium Γ1.4 Β· **enterprise Γ1.8**), and wrong closures are punished the same way. Close an enterprise ticket incorrectly and it hurts **~3Γ what a free-tier one does**. |
---
## 3. What the environment looks like under the hood
The environment runs as a standard OpenEnv FastAPI server β same Gym-style `reset / step` contract, same Pydantic observation/action schemas, same Docker image format for Hugging Face Spaces.
### Observation (partial)
```json
{
"incident_id": "inc-cert-expiry",
"incident_title": "mTLS cert expired β all microservices throwing 500s",
"incident_description": "Alerting fired at 03:12 UTC ...",
"customer_tier": "enterprise",
"affected_users_estimate": 140000,
"revenue_impact_usd_per_min": 4800,
"postmortem_required": true,
"visible_signals": ["mtls handshake errors", "5xx spike in checkout"],
"investigation_targets": {
"logs": ["cert-manager", "auth-service"],
"metrics": ["dash-mesh", "dash-auth"],
"kb": ["kb-mtls-chain", "kb-cert-rotation"]
},
"allowed_actors_by_action": {
"apply_fix": ["investigator_agent"],
"close_incident": ["ops_manager_agent"]
},
"budget_remaining": 18,
"sla_minutes_remaining": 40,
"clues_found": 2,
"mitigation_applied": false,
"reward_components": {"step_cost": -0.04, "clue_bonus": +0.12}
}
```
### Action space
| action_type | Typical actor | Purpose |
|---|---|---|
| `inspect_logs` / `inspect_metrics` / `consult_kb` | triage / investigator | Gather clues (reward shapes here) |
| `negotiate_handoff` | ops_manager | Route to correct owner |
| `apply_fix` | investigator | Apply mitigation (scored vs ground truth) |
| `rollback` | investigator | Revert last change |
| `escalate` | ops_manager | Engage senior staff |
| `submit_postmortem` | ops_manager | Required on tier-1 / high-revenue incidents |
| `close_incident` | ops_manager | Terminal action β final score depends on clues found + mitigation quality + post-mortem + speed |
### Reward rubric (composable, not monolithic)
The reward engine emits **named components** at every step so training curves β and judges β can see *exactly where reward came from*:
| Component | When it fires | Sign |
|---|---|---|
| `step_cost` | Every action (β0.01 to β0.08 by action type) | β |
| `clue_bonus` | Unique log/metric/KB lookup that surfaces a real fact | **+** |
| `handoff_correct` / `handoff_wrong` | Ops manager routes to allowed / disallowed owner | Β± |
| `mitigation_correct` / `mitigation_wrong` / `mitigation_empty` | Fix matches / contradicts / omits ground-truth keywords | Β± |
| `rollback_effective` / `rollback_ineffective` | Rollback summary matches the incident's accepted playbook | Β± |
| `escalation_needed` / `escalation_not_needed` | Escalation raised for an incident that actually warrants it | Β± |
| `closure_correct` / `closure_wrong` | Final close decision matches incident state | Β± (scaled by customer tier) |
| `closure_mitigation_bonus` | Close after a correct mitigation | **+** |
| `closure_under_investigated` | Close without enough clues found | β |
| `speed_bonus` | Close in β€ 7 / β€ 4 steps | **+** |
| `postmortem_bonus` / `postmortem_missing` | Post-mortem filed / skipped on a high-impact incident | Β± |
| `repeated_lookup_penalty` | Re-querying the same log/metric/KB | β |
| `wrong_actor_penalty` | Action invoked by a role that's not authorised | β |
| `invalid_action` | Unrecognised `action_type` | β |
| `sla_exhausted` / `budget_exhausted` | Terminal penalty when SLA / action budget hits zero | β |
**Anti-gaming:** closing early with zero clues is penalised; spamming cheap `inspect_logs` racks up `repeated_lookup_penalty`; triggering `apply_fix` without investigator permissions gives `wrong_actor_penalty`. A policy cannot shortcut its way to a high score.
---
## 4. Training: HF TRL SFT on heuristic rollouts
I first wrote a deterministic `HeuristicCoordinator` that uses the observation's `investigation_targets` and role constraints to play through the environment. On hard tasks it earns **+5.89** reward where random scores **β12.50** β so that gives us ~680 `(prompt, completion)` pairs of "good" behavior to imitate.
Training script: [`train_trl.py`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/train_trl.py). One command on Colab T4 (or **[open the reproducible notebook β](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)**) runs the entire pipeline:
```python
os.environ["BASE_MODEL"] = "Qwen/Qwen2.5-1.5B-Instruct"
os.environ["EPISODES_PER_TASK"] = "8"
os.environ["TRAIN_EPOCHS"] = "3"
os.environ["EVAL_LLM_MODELS"] = "true"
os.environ["MAX_LLM_EVAL_STEPS"] = "120"
!python train_trl.py
```
The script:
1. Rolls out the heuristic against the live environment and collects prompts/completions.
2. Runs TRL `SFTTrainer` with a single `text` column (chat-template applied).
3. Saves the fine-tuned checkpoint to `artifacts/sft_model/`.
4. Rolls out **four** policies under identical seeds β random, heuristic, base LLM, fine-tuned LLM.
5. Writes `reward_curve.png`, `training_curve.png`, `reward_components.png`, `summary_metrics.json`, and `training_log.json`.
### Training loss + token accuracy

Loss drops from **~2.84 β ~0.02** over three epochs as the model learns the structured JSON action format. Mean token accuracy climbs from **~0.49 β ~0.99**. Satisfies the hackathon "loss AND reward plots" minimum requirement.
### Four-policy reward comparison

| Task | Random | Base LLM | **Fine-tuned (SFT)** | Heuristic |
|---|---:|---:|---:|---:|
| easy | β5.96 | β2.92 | **β4.72** | β4.72 |
| medium | β11.48 | β4.00 | **β0.87** | β0.87 |
| hard | β12.50 | β4.28 | **+5.89** | +5.89 |
**Fine-tuned vs untrained base: +10.17 reward delta on hard-difficulty incidents.**
- **Random** is the floor on every task.
- **Base LLM** already beats random on easy because it produces well-formed-ish JSON β but it never closes a single incident, so it just racks up step-costs and SLA penalties.
- **Fine-tuned LLM** catches the heuristic teacher exactly. The environment is deterministic and SFT hit ~0.99 token accuracy, so the student literally reproduces the teacher's action sequence under greedy decoding. This is **imitation learning converging to the expert** β the meaningful headline number is therefore **SFT vs base**, not SFT vs heuristic.
### Reward sources β what each policy actually earns

This is the chart I'm proudest of, because it makes the training signal **legible**. Summed across all three tasks:
- **Random** bleeds out: `closure_wrong: β17.82` Β· `wrong_actor_penalty: β3.12` Β· `mitigation_wrong: β2.10`.
- **Base LLM** earns `clue_bonus: +0.24` but then gets crushed by `step_cost: β5.16` and `sla_exhausted: β5.04`. It **never fires a single positive closure component**.
- **Fine-tuned LLM** unlocks the high-value positive components the base never sees: `closure_correct: +7.36` Β· `mitigation_correct: +2.10` Β· `closure_mitigation_bonus: +1.80` Β· `postmortem_bonus: +0.60` Β· `handoff_correct: +0.75` Β· `speed_bonus: +0.60`.
**Training has moved the LLM from "bleeding" to "solving."**
---
## 5. Why does SFT exactly match the heuristic?
Honest framing matters. The environment is deterministic (same task β same incidents β same observations β same seeds). The heuristic coordinator is also deterministic (same observation β same action). So every rollout of a given task produces a byte-identical trajectory. Our 680-row dataset contains only ~85 *unique* `(observation, action)` pairs, each duplicated for redundancy. At ~0.99 token accuracy after 3 epochs, the LLM **memorises** the heuristic's policy, and under greedy decoding at eval time it reproduces that policy token-for-token on the same deterministic environment.
> **This is the defining success condition for behavior cloning: the student has become the teacher.**
The gap we can legitimately celebrate is therefore **SFT vs the untrained base model**, where:
- On **hard incidents**, SFT earns **+10.17** more reward than base.
- SFT **unlocks** reward components (`closure_correct`, `mitigation_correct`, `postmortem_bonus`) that the base model literally never fires.
- On easy tasks, SFT inherits the teacher's known weakness (easy tasks have tight SLA budgets that punish thorough investigation). This is exactly what imitation learning should do β including the teacher's mistakes.
The obvious next step to go **beyond** the heuristic ceiling is RL with the environment's native reward signal β GRPO or PPO against the same rubric β which is the natural Round 3 work.
---
## 6. The surprise finding β scale is the story
I ran the exact same pipeline with the smaller **Qwen2.5-0.5B-Instruct** backbone (same environment, same seeds, same heuristic teacher, same reward rubric). The story flips entirely:

| Task | Random | Base **0.5B** | **SFT 0.5B** | Heuristic | SFT β Base (0.5B) |
|---|---:|---:|---:|---:|---:|
| easy | β5.96 | β2.92 | **β2.49** | β4.72 | **+0.43** |
| medium | β11.48 | β4.00 | **β3.86** | β0.87 | **+0.14** |
| hard | β12.50 | β2.40 | **β2.40** | +5.89 | **+0.00** |
**The punchline:** with a 0.5B backbone, SFT delivers only a **+0.43 / +0.14 / +0.00** improvement over the base model and **never closes a single hard incident**. Bumping the backbone to **1.5B** β same SFT code, same data pipeline, same environment β unlocks a **β1.80 / +3.13 / +10.17** improvement and makes the LLM match the heuristic's component-for-component behavior on hard incidents.
| Run config | 0.5B | **1.5B (headline)** |
|---|---|---|
| Base model | Qwen2.5-0.5B-Instruct | Qwen2.5-1.5B-Instruct |
| Episodes / task (rollout) | 3 | 8 |
| Dataset rows | 255 | 680 |
| Train epochs | 1 | 3 |
| Base β SFT improvement on **hard** | **+0.00** | **+10.17** |
| Hard incidents closed by SFT | **0** | **full heuristic behavior** |
**Interpretation:** at 0.5B the model is *too small* to absorb this multi-step, role-gated policy from SFT, even though it can emit syntactically valid JSON. At 1.5B the capacity suddenly becomes sufficient to internalise the full action schedule, and behavior cloning converges. **This is the kind of finding the environment is designed to surface β the composable rubric makes it visible in one plot, not hidden behind a single aggregate score.**
---
## 7. Everything you need to reproduce this
| | |
|---|---|
| **Live environment** | [swapnilpatil28-multi-agent-incident-command-center.hf.space](https://swapnilpatil28-multi-agent-incident-command-center.hf.space) (OpenEnv-compatible, Docker-backed) |
| **Training notebook** | [One-click Colab (T4, ~1 h 15 min end-to-end)](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing) |
| **Source + tests** | [GitHub repo (21 passing tests, Dockerfile with HEALTHCHECK)](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
| **Full docs** | [README β Part 1 story + Part 2 technical deep-dive](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/README.md) |
| **Committed evidence** | [`artifacts/`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center/tree/main/artifacts) β all 4 PNGs + both JSON metric files |
| **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |
---
## 8. What's next (Planned)
- **Replace SFT with GRPO or PPO** using the environment's native reward signal β no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
- **Scale the incident catalog** from 13 templates to 50+ (drop in JSON-defined scenarios).
- **Add a second "adversarial" agent** that injects misleading signals to test robustness.
If you want to run it yourself, the Space and the repo are fully self-contained β `docker run` the image and point any OpenEnv-compatible client at it. Or just hit `/reset` and `/step` yourself from any language that can speak HTTP JSON.
---
*Built with β₯ on [Meta OpenEnv](https://github.com/meta-pytorch/openenv) for the OpenEnv India 2026 Round 2 hackathon.*
*Code: [GitHub](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) Β· Space: [HF Space](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center) Β· Training notebook: [Colab](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing).*
|