Final Update README.md
Browse files
README.md
CHANGED
|
@@ -23,6 +23,104 @@ tags:
|
|
| 23 |
|
| 24 |
[](./tests) [](https://github.com/meta-pytorch/openenv) [](./LICENSE) 
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
### Live links
|
| 27 |
|
| 28 |
| What | Where |
|
|
|
|
| 23 |
|
| 24 |
[](./tests) [](https://github.com/meta-pytorch/openenv) [](./LICENSE) 
|
| 25 |
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## Part 1 β The story in 2 minutes
|
| 29 |
+
|
| 30 |
+
> **When a real tech company has an outage, three people's phones buzz at once.** A Triage engineer, an Investigator, and an Ops Manager have to cooperate under a ticking SLA clock while every extra action costs budget. **We built a simulator that teaches LLMs to do that job β and fine-tuned one that does it as well as the human expert.**
|
| 31 |
+
|
| 32 |
+
### The problem, in one line
|
| 33 |
+
|
| 34 |
+
Real incident response isn't "pick the right label." It's **multi-agent, long-horizon, partially observable** teamwork β and it's exactly where general-purpose LLMs fall over. We built an OpenEnv simulator of a live tech-company war room so agents can *practice* the job, end-to-end.
|
| 35 |
+
|
| 36 |
+
### The environment, as a picture
|
| 37 |
+
|
| 38 |
+
A **virtual war room** where three specialist agents resolve a live queue of real-world tech incidents:
|
| 39 |
+
|
| 40 |
+
| Role | Can do | Cannot do |
|
| 41 |
+
|---|---|---|
|
| 42 |
+
| π **Triage agent** | Pull logs Β· check metrics Β· consult KB | Close a ticket |
|
| 43 |
+
| π§ͺ **Investigator** | Apply a fix Β· roll back a deploy | Escalate or file a post-mortem |
|
| 44 |
+
| π· **Ops Manager** | Escalate Β· file post-mortem Β· **close the ticket** | Apply a code fix |
|
| 45 |
+
|
| 46 |
+
**13 real incidents** Β· **3 difficulty tiers** (easy / medium / hard) Β· **14+ named reward signals** Β· **customer-tier weighting** (enterprise outages cost ~3Γ a free-tier outage)
|
| 47 |
+
|
| 48 |
+
> Wrong actor β **β0.08**. Wrong root-cause on an enterprise ticket β **β1.98**. Correct closure on an enterprise ticket β **+1.44**. The rules matter β and every step tells you *why* it was scored.
|
| 49 |
+
|
| 50 |
+
### The headline result
|
| 51 |
+
|
| 52 |
+
One picture, four policies, three difficulty tiers:
|
| 53 |
+
|
| 54 |
+

|
| 55 |
+
|
| 56 |
+
| Policy | What is it? | Hard-tier reward |
|
| 57 |
+
|---|---|---:|
|
| 58 |
+
| π΄ **Random** | Picks an action uniformly | **β12.50** |
|
| 59 |
+
| π **Base Qwen2.5-1.5B** | Off-the-shelf LLM, **no fine-tuning** | **β4.28** |
|
| 60 |
+
| π’ **Our fine-tuned LLM** | Same model, SFT on 680 rollout examples | **+5.89** |
|
| 61 |
+
| π΅ **Heuristic (oracle)** | Human-written "ideal" policy | **+5.89** |
|
| 62 |
+
|
| 63 |
+
> **The AI went from β4.28 β +5.89 on hard incidents β a +10.17 reward swing β and matched the human expert component-for-component.**
|
| 64 |
+
|
| 65 |
+
### What did the agent actually learn?
|
| 66 |
+
|
| 67 |
+
Not "which label to pick." It learned **a whole workflow** β and the reward rubric makes that visible:
|
| 68 |
+
|
| 69 |
+

|
| 70 |
+
|
| 71 |
+
| Before fine-tuning π | After fine-tuning π’ |
|
| 72 |
+
|---|---|
|
| 73 |
+
| Only earns `clue_bonus` (+0.24) | Unlocks **`closure_correct +7.36`** Β· **`mitigation_correct +2.10`** Β· **`postmortem_bonus +0.60`** |
|
| 74 |
+
| Bleeds `step_cost` (β5.16) and `sla_exhausted` (β5.04) | Respects the SLA β **zero** `sla_exhausted` |
|
| 75 |
+
| Closes **0** incidents correctly | Closes incidents **like the expert does** |
|
| 76 |
+
| "Looks busy" but times out | Actually solves the problem |
|
| 77 |
+
|
| 78 |
+
### How training went (the short version)
|
| 79 |
+
|
| 80 |
+

|
| 81 |
+
|
| 82 |
+
| Step | What happened |
|
| 83 |
+
|---|---|
|
| 84 |
+
| 1. **Collect** | Run the expert heuristic over every incident β **680 rollout examples** (prompt = observation, completion = structured action) |
|
| 85 |
+
| 2. **Supervise** | TRL `SFTTrainer`, 3 epochs β loss **2.84 β 0.02**, token accuracy **0.49 β 0.99** |
|
| 86 |
+
| 3. **Evaluate** | Re-run random / heuristic / base-LLM / SFT-LLM under identical seeds |
|
| 87 |
+
| 4. **Plot** | Reward curve, training curve, reward-component breakdown β all committed to [`artifacts/`](./artifacts) |
|
| 88 |
+
|
| 89 |
+
### The surprise finding β size matters
|
| 90 |
+
|
| 91 |
+
Same pipeline, same data recipe, smaller backbone:
|
| 92 |
+
|
| 93 |
+
| Backbone | Dataset rows | Base β SFT on **hard** | Hard incidents closed |
|
| 94 |
+
|---|---:|---:|---|
|
| 95 |
+
| Qwen2.5-**0.5B**-Instruct | 255 | **+0.00** | **0** |
|
| 96 |
+
| Qwen2.5-**1.5B**-Instruct | 680 | **+10.17** | full expert behavior |
|
| 97 |
+
|
| 98 |
+
> At **0.5B** the model is *too small* to absorb this multi-step, role-gated policy even with perfect supervision. At **1.5B** capacity is suddenly sufficient and behavior cloning converges. The rubric surfaces this β it's not hidden inside a single aggregate score.
|
| 99 |
+
|
| 100 |
+
### Why this environment hits all three hackathon themes
|
| 101 |
+
|
| 102 |
+
| Theme | How we satisfy it |
|
| 103 |
+
|---|---|
|
| 104 |
+
| **#1 Multi-agent** | Three roles with **different permissions** who have to cooperate. Wrong-actor calls are punished (β0.08). Correct handoff is rewarded (+0.15). |
|
| 105 |
+
| **#2 Long-horizon** | Each episode runs **3β5 sequential incidents**, 20β60 steps each, under one ticking SLA clock. The big reward (+0.80 Γ tier) only fires after clues β fix β post-mortem. Sparse and delayed by design. |
|
| 106 |
+
| **#3 Professional world-model** | Real tech incidents with **logs, metrics, KB articles, red-herring signals, customer-tier revenue impact, SLA clocks**. Close an enterprise ticket wrong and it hurts ~3Γ what a free-tier one does. |
|
| 107 |
+
|
| 108 |
+
### Try it in 30 seconds
|
| 109 |
+
|
| 110 |
+
| | |
|
| 111 |
+
|---|---|
|
| 112 |
+
| π’ **Live environment** | **[Open the dashboard β](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** |
|
| 113 |
+
| π» **Source code** | **[GitHub repo β](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
|
| 114 |
+
| π **Reproduce the training** | **[One-click Colab notebook β](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
|
| 115 |
+
| πΊ **2-minute video walkthrough** | *Coming soon β shot list in [`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md)* |
|
| 116 |
+
| π **Mini blog post** | *Coming soon β full draft in [`docs/BLOG_POST.md`](./docs/BLOG_POST.md)* |
|
| 117 |
+
|
| 118 |
+
> Want the rubric math, architecture, full numbers, configuration, and the hackathon checklist? Keep scrolling β **Part 2** is the full technical README.
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## Part 2 β Technical deep dive
|
| 123 |
+
|
| 124 |
### Live links
|
| 125 |
|
| 126 |
| What | Where |
|