File size: 18,559 Bytes
02f3541
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8cbdbde
02f3541
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8062d98
02f3541
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8062d98
 
 
02f3541
 
 
8062d98
02f3541
 
8cbdbde
02f3541
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
# Teaching LLMs to Run an Incident Command Center

> *India's Biggest Mega AI Hackathon β€” Built on Meta OpenEnv Β· Round 2*

**TL;DR** β€” I built an [OpenEnv](https://github.com/meta-pytorch/openenv) environment where three specialist agents β€” **Triage**, **Investigator**, and **Ops Manager** β€” cooperate to resolve real-world tech incidents under SLA pressure, budget constraints, and customer-tier business impact. I then fine-tuned Qwen2.5-1.5B-Instruct on heuristic rollouts and watched it **close a +10.17-reward gap on hard incidents**, matching the hand-coded expert policy component-for-component. A separate 0.5B ablation shows that **model scale is the story** β€” same pipeline, same data schema, but the smaller backbone never closes a single hard incident.

### πŸ”— Everything in one place

| What | Where |
|---|---|
| 🟒 **Live environment (OpenEnv-compatible)** | **[swapnilpatil28-multi-agent-incident-command-center.hf.space β†—](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** |
| πŸ€— **Hugging Face Space page** | **[huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center β†—](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
| πŸ’» **GitHub source code** | **[github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center β†—](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
| πŸŽ“ **Reproducible training (Colab T4)** | **[Open in Colab β†—](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
| πŸ“– **Full README** (story + technical deep-dive) | **[github.com/.../README.md β†—](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center#readme)** |
| βœ… **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |

---

## 1. The story in 2 minutes (for anyone)

**When a real tech company has an outage, three people's phones buzz at once.** A Triage engineer scans logs and dashboards. An Investigator forms a hypothesis and applies a fix. An Ops Manager decides who owns the work, whether to escalate, and when to officially close the incident.

Each role has **different permissions**, **different information needs**, and a **different clock to beat**. Get it wrong and you bleed budget, bust the SLA, and β€” if the customer is on an enterprise contract β€” lose serious money (~3Γ— what a free-tier outage costs).

I built a simulator of that war room β€” an **OpenEnv-compatible** environment with 13 realistic incidents, 3 specialist roles, and 14+ named reward signals β€” and fine-tuned an LLM to run it.

| Role | Can do | Cannot do |
|---|---|---|
| πŸ” **Triage agent** | Pull logs Β· check metrics Β· consult KB articles | Close a ticket |
| πŸ§ͺ **Investigator** | Apply a fix Β· roll back a deploy | Escalate or file a post-mortem |
| πŸ‘· **Ops Manager** | Escalate Β· file post-mortem Β· **close the ticket** | Apply a code fix |

> **The headline number:** the fine-tuned LLM earns **+10.17 more reward on hard incidents** than the untrained base β€” and matches the human-written expert policy component-for-component.

![Reward curve comparing random, base LLM, fine-tuned LLM, and heuristic on easy, medium, and hard tasks](https://raw.githubusercontent.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/main/artifacts/reward_curve.png)

One picture, four policies, three difficulty tiers. Random is the floor. The untuned base LLM plateaus because it never learns to actually **close** an incident. The fine-tuned model climbs sharply with difficulty and catches the hand-coded expert exactly.

---

## 2. Why this is a real RL problem (three themes in one environment)

Most RL environments for LLMs are single-agent, single-step, or turn-based games. Real enterprise work is none of those. This environment deliberately tests all three Round-2 hackathon themes simultaneously:

| Hackathon theme | How this environment satisfies it |
|---|---|
| **🀝 #1 β€” Multi-Agent Interactions** | Three distinct specialist roles with **non-overlapping permissions**. Acting out-of-role triggers a `wrong_actor_penalty` (βˆ’0.08). Correct handoffs earn `+0.15`. Collaboration is **trained**, not hard-coded. |
| **⏱️ #2 β€” Long-Horizon Planning** | Each episode carries **3–5 sequential incidents**, 20–60 steps apiece, under a single ticking SLA clock. The big reward (+0.80 Γ— tier) only fires after clues β†’ fix β†’ post-mortem. **Sparse and delayed by design** β€” the 20-step credit-assignment problem is the whole point. |
| **🏒 #3 β€” World Modeling / Professional Tasks** | Incidents carry **real logs, metrics, KB articles, red-herring signals**, and **business metadata** (customer tier, affected users, $/min revenue impact). Closure rewards scale by tier (free Γ—0.6 Β· standard Γ—1.0 Β· premium Γ—1.4 Β· **enterprise Γ—1.8**), and wrong closures are punished the same way. Close an enterprise ticket incorrectly and it hurts **~3Γ— what a free-tier one does**. |

---

## 3. What the environment looks like under the hood

The environment runs as a standard OpenEnv FastAPI server β€” same Gym-style `reset / step` contract, same Pydantic observation/action schemas, same Docker image format for Hugging Face Spaces.

### Observation (partial)

```json
{
  "incident_id": "inc-cert-expiry",
  "incident_title": "mTLS cert expired β€” all microservices throwing 500s",
  "incident_description": "Alerting fired at 03:12 UTC ...",
  "customer_tier": "enterprise",
  "affected_users_estimate": 140000,
  "revenue_impact_usd_per_min": 4800,
  "postmortem_required": true,
  "visible_signals": ["mtls handshake errors", "5xx spike in checkout"],
  "investigation_targets": {
    "logs": ["cert-manager", "auth-service"],
    "metrics": ["dash-mesh", "dash-auth"],
    "kb": ["kb-mtls-chain", "kb-cert-rotation"]
  },
  "allowed_actors_by_action": {
    "apply_fix": ["investigator_agent"],
    "close_incident": ["ops_manager_agent"]
  },
  "budget_remaining": 18,
  "sla_minutes_remaining": 40,
  "clues_found": 2,
  "mitigation_applied": false,
  "reward_components": {"step_cost": -0.04, "clue_bonus": +0.12}
}
```

### Action space

| action_type | Typical actor | Purpose |
|---|---|---|
| `inspect_logs` / `inspect_metrics` / `consult_kb` | triage / investigator | Gather clues (reward shapes here) |
| `negotiate_handoff` | ops_manager | Route to correct owner |
| `apply_fix` | investigator | Apply mitigation (scored vs ground truth) |
| `rollback` | investigator | Revert last change |
| `escalate` | ops_manager | Engage senior staff |
| `submit_postmortem` | ops_manager | Required on tier-1 / high-revenue incidents |
| `close_incident` | ops_manager | Terminal action β€” final score depends on clues found + mitigation quality + post-mortem + speed |

### Reward rubric (composable, not monolithic)

The reward engine emits **named components** at every step so training curves β€” and judges β€” can see *exactly where reward came from*:

| Component | When it fires | Sign |
|---|---|---|
| `step_cost` | Every action (βˆ’0.01 to βˆ’0.08 by action type) | βˆ’ |
| `clue_bonus` | Unique log/metric/KB lookup that surfaces a real fact | **+** |
| `handoff_correct` / `handoff_wrong` | Ops manager routes to allowed / disallowed owner | Β± |
| `mitigation_correct` / `mitigation_wrong` / `mitigation_empty` | Fix matches / contradicts / omits ground-truth keywords | Β± |
| `rollback_effective` / `rollback_ineffective` | Rollback summary matches the incident's accepted playbook | Β± |
| `escalation_needed` / `escalation_not_needed` | Escalation raised for an incident that actually warrants it | Β± |
| `closure_correct` / `closure_wrong` | Final close decision matches incident state | Β± (scaled by customer tier) |
| `closure_mitigation_bonus` | Close after a correct mitigation | **+** |
| `closure_under_investigated` | Close without enough clues found | βˆ’ |
| `speed_bonus` | Close in ≀ 7 / ≀ 4 steps | **+** |
| `postmortem_bonus` / `postmortem_missing` | Post-mortem filed / skipped on a high-impact incident | Β± |
| `repeated_lookup_penalty` | Re-querying the same log/metric/KB | βˆ’ |
| `wrong_actor_penalty` | Action invoked by a role that's not authorised | βˆ’ |
| `invalid_action` | Unrecognised `action_type` | βˆ’ |
| `sla_exhausted` / `budget_exhausted` | Terminal penalty when SLA / action budget hits zero | βˆ’ |

**Anti-gaming:** closing early with zero clues is penalised; spamming cheap `inspect_logs` racks up `repeated_lookup_penalty`; triggering `apply_fix` without investigator permissions gives `wrong_actor_penalty`. A policy cannot shortcut its way to a high score.

---

## 4. Training: HF TRL SFT on heuristic rollouts

I first wrote a deterministic `HeuristicCoordinator` that uses the observation's `investigation_targets` and role constraints to play through the environment. On hard tasks it earns **+5.89** reward where random scores **βˆ’12.50** β€” so that gives us ~680 `(prompt, completion)` pairs of "good" behavior to imitate.

Training script: [`train_trl.py`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/train_trl.py). One command on Colab T4 (or **[open the reproducible notebook β†—](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)**) runs the entire pipeline:

```python
os.environ["BASE_MODEL"]         = "Qwen/Qwen2.5-1.5B-Instruct"
os.environ["EPISODES_PER_TASK"]  = "8"
os.environ["TRAIN_EPOCHS"]       = "3"
os.environ["EVAL_LLM_MODELS"]    = "true"
os.environ["MAX_LLM_EVAL_STEPS"] = "120"
!python train_trl.py
```

The script:

1. Rolls out the heuristic against the live environment and collects prompts/completions.
2. Runs TRL `SFTTrainer` with a single `text` column (chat-template applied).
3. Saves the fine-tuned checkpoint to `artifacts/sft_model/`.
4. Rolls out **four** policies under identical seeds β€” random, heuristic, base LLM, fine-tuned LLM.
5. Writes `reward_curve.png`, `training_curve.png`, `reward_components.png`, `summary_metrics.json`, and `training_log.json`.

### Training loss + token accuracy

![SFT training loss dropping from ~2.84 to ~0.02 and token accuracy climbing from ~0.49 to ~0.99 over 3 epochs](https://raw.githubusercontent.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/main/artifacts/training_curve.png)

Loss drops from **~2.84 β†’ ~0.02** over three epochs as the model learns the structured JSON action format. Mean token accuracy climbs from **~0.49 β†’ ~0.99**. Satisfies the hackathon "loss AND reward plots" minimum requirement.

### Four-policy reward comparison

![Reward curve comparing random, base LLM, fine-tuned LLM, and heuristic on easy, medium, and hard tasks](https://raw.githubusercontent.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/main/artifacts/reward_curve.png)

| Task | Random | Base LLM | **Fine-tuned (SFT)** | Heuristic |
|---|---:|---:|---:|---:|
| easy | βˆ’5.96 | βˆ’2.92 | **βˆ’4.72** | βˆ’4.72 |
| medium | βˆ’11.48 | βˆ’4.00 | **βˆ’0.87** | βˆ’0.87 |
| hard | βˆ’12.50 | βˆ’4.28 | **+5.89** | +5.89 |

**Fine-tuned vs untrained base: +10.17 reward delta on hard-difficulty incidents.**

- **Random** is the floor on every task.
- **Base LLM** already beats random on easy because it produces well-formed-ish JSON β€” but it never closes a single incident, so it just racks up step-costs and SLA penalties.
- **Fine-tuned LLM** catches the heuristic teacher exactly. The environment is deterministic and SFT hit ~0.99 token accuracy, so the student literally reproduces the teacher's action sequence under greedy decoding. This is **imitation learning converging to the expert** β€” the meaningful headline number is therefore **SFT vs base**, not SFT vs heuristic.

### Reward sources β€” what each policy actually earns

![Stacked-bar chart showing where each policy earns or loses reward, broken down by rubric component](https://raw.githubusercontent.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/main/artifacts/reward_components.png)

This is the chart I'm proudest of, because it makes the training signal **legible**. Summed across all three tasks:

- **Random** bleeds out: `closure_wrong: βˆ’17.82` Β· `wrong_actor_penalty: βˆ’3.12` Β· `mitigation_wrong: βˆ’2.10`.
- **Base LLM** earns `clue_bonus: +0.24` but then gets crushed by `step_cost: βˆ’5.16` and `sla_exhausted: βˆ’5.04`. It **never fires a single positive closure component**.
- **Fine-tuned LLM** unlocks the high-value positive components the base never sees: `closure_correct: +7.36` Β· `mitigation_correct: +2.10` Β· `closure_mitigation_bonus: +1.80` Β· `postmortem_bonus: +0.60` Β· `handoff_correct: +0.75` Β· `speed_bonus: +0.60`.

**Training has moved the LLM from "bleeding" to "solving."**

---

## 5. Why does SFT exactly match the heuristic?

Honest framing matters. The environment is deterministic (same task β†’ same incidents β†’ same observations β†’ same seeds). The heuristic coordinator is also deterministic (same observation β†’ same action). So every rollout of a given task produces a byte-identical trajectory. Our 680-row dataset contains only ~85 *unique* `(observation, action)` pairs, each duplicated for redundancy. At ~0.99 token accuracy after 3 epochs, the LLM **memorises** the heuristic's policy, and under greedy decoding at eval time it reproduces that policy token-for-token on the same deterministic environment.

> **This is the defining success condition for behavior cloning: the student has become the teacher.**

The gap we can legitimately celebrate is therefore **SFT vs the untrained base model**, where:

- On **hard incidents**, SFT earns **+10.17** more reward than base.
- SFT **unlocks** reward components (`closure_correct`, `mitigation_correct`, `postmortem_bonus`) that the base model literally never fires.
- On easy tasks, SFT inherits the teacher's known weakness (easy tasks have tight SLA budgets that punish thorough investigation). This is exactly what imitation learning should do β€” including the teacher's mistakes.

The obvious next step to go **beyond** the heuristic ceiling is RL with the environment's native reward signal β€” GRPO or PPO against the same rubric β€” which is the natural Round 3 work.

---

## 6. The surprise finding β€” scale is the story

I ran the exact same pipeline with the smaller **Qwen2.5-0.5B-Instruct** backbone (same environment, same seeds, same heuristic teacher, same reward rubric). The story flips entirely:

![Reward curve for the 0.5B ablation β€” SFT barely improves over base and never closes a hard incident](https://raw.githubusercontent.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center/main/artifacts/reward_curve_qwen0p5b.png)

| Task | Random | Base **0.5B** | **SFT 0.5B** | Heuristic | SFT βˆ’ Base (0.5B) |
|---|---:|---:|---:|---:|---:|
| easy | βˆ’5.96 | βˆ’2.92 | **βˆ’2.49** | βˆ’4.72 | **+0.43** |
| medium | βˆ’11.48 | βˆ’4.00 | **βˆ’3.86** | βˆ’0.87 | **+0.14** |
| hard | βˆ’12.50 | βˆ’2.40 | **βˆ’2.40** | +5.89 | **+0.00** |

**The punchline:** with a 0.5B backbone, SFT delivers only a **+0.43 / +0.14 / +0.00** improvement over the base model and **never closes a single hard incident**. Bumping the backbone to **1.5B** β€” same SFT code, same data pipeline, same environment β€” unlocks a **βˆ’1.80 / +3.13 / +10.17** improvement and makes the LLM match the heuristic's component-for-component behavior on hard incidents.

| Run config | 0.5B | **1.5B (headline)** |
|---|---|---|
| Base model | Qwen2.5-0.5B-Instruct | Qwen2.5-1.5B-Instruct |
| Episodes / task (rollout) | 3 | 8 |
| Dataset rows | 255 | 680 |
| Train epochs | 1 | 3 |
| Base β†’ SFT improvement on **hard** | **+0.00** | **+10.17** |
| Hard incidents closed by SFT | **0** | **full heuristic behavior** |

**Interpretation:** at 0.5B the model is *too small* to absorb this multi-step, role-gated policy from SFT, even though it can emit syntactically valid JSON. At 1.5B the capacity suddenly becomes sufficient to internalise the full action schedule, and behavior cloning converges. **This is the kind of finding the environment is designed to surface β€” the composable rubric makes it visible in one plot, not hidden behind a single aggregate score.**

---

## 7. Everything you need to reproduce this

| | |
|---|---|
| **Live environment** | [swapnilpatil28-multi-agent-incident-command-center.hf.space](https://swapnilpatil28-multi-agent-incident-command-center.hf.space) (OpenEnv-compatible, Docker-backed) |
| **Training notebook** | [One-click Colab (T4, ~1 h 15 min end-to-end)](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing) |
| **Source + tests** | [GitHub repo (21 passing tests, Dockerfile with HEALTHCHECK)](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
| **Full docs** | [README β€” Part 1 story + Part 2 technical deep-dive](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/README.md) |
| **Committed evidence** | [`artifacts/`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center/tree/main/artifacts) β€” all 4 PNGs + both JSON metric files |
| **Submission checklist** | [`docs/SUBMISSION_CHECKLIST.md`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center/blob/main/docs/SUBMISSION_CHECKLIST.md) |

---

## 8. What's next (Planned)

- **Replace SFT with GRPO or PPO** using the environment's native reward signal β€” no heuristic teacher, let the rubric itself shape the policy and push past the imitation ceiling.
- **Scale the incident catalog** from 13 templates to 50+ (drop in JSON-defined scenarios).
- **Add a second "adversarial" agent** that injects misleading signals to test robustness.

If you want to run it yourself, the Space and the repo are fully self-contained β€” `docker run` the image and point any OpenEnv-compatible client at it. Or just hit `/reset` and `/step` yourself from any language that can speak HTTP JSON.

---

*Built with β™₯ on [Meta OpenEnv](https://github.com/meta-pytorch/openenv) for the OpenEnv India 2026 Round 2 hackathon.*
*Code: [GitHub](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) Β· Space: [HF Space](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center) Β· Training notebook: [Colab](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing).*