SwapnilPatil28 commited on
Commit
0a34397
Β·
verified Β·
1 Parent(s): c3648b5

Final Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md CHANGED
@@ -23,6 +23,104 @@ tags:
23
 
24
  [![Tests](https://img.shields.io/badge/tests-21%20passing-brightgreen)](./tests) [![OpenEnv](https://img.shields.io/badge/OpenEnv-v0.2%2B-blue)](https://github.com/meta-pytorch/openenv) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE) ![Python](https://img.shields.io/badge/python-3.10%2B-blue)
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ### Live links
27
 
28
  | What | Where |
 
23
 
24
  [![Tests](https://img.shields.io/badge/tests-21%20passing-brightgreen)](./tests) [![OpenEnv](https://img.shields.io/badge/OpenEnv-v0.2%2B-blue)](https://github.com/meta-pytorch/openenv) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE) ![Python](https://img.shields.io/badge/python-3.10%2B-blue)
25
 
26
+ ---
27
+
28
+ ## Part 1 β€” The story in 2 minutes
29
+
30
+ > **When a real tech company has an outage, three people's phones buzz at once.** A Triage engineer, an Investigator, and an Ops Manager have to cooperate under a ticking SLA clock while every extra action costs budget. **We built a simulator that teaches LLMs to do that job β€” and fine-tuned one that does it as well as the human expert.**
31
+
32
+ ### The problem, in one line
33
+
34
+ Real incident response isn't "pick the right label." It's **multi-agent, long-horizon, partially observable** teamwork β€” and it's exactly where general-purpose LLMs fall over. We built an OpenEnv simulator of a live tech-company war room so agents can *practice* the job, end-to-end.
35
+
36
+ ### The environment, as a picture
37
+
38
+ A **virtual war room** where three specialist agents resolve a live queue of real-world tech incidents:
39
+
40
+ | Role | Can do | Cannot do |
41
+ |---|---|---|
42
+ | πŸ” **Triage agent** | Pull logs Β· check metrics Β· consult KB | Close a ticket |
43
+ | πŸ§ͺ **Investigator** | Apply a fix Β· roll back a deploy | Escalate or file a post-mortem |
44
+ | πŸ‘· **Ops Manager** | Escalate Β· file post-mortem Β· **close the ticket** | Apply a code fix |
45
+
46
+ **13 real incidents** Β· **3 difficulty tiers** (easy / medium / hard) Β· **14+ named reward signals** Β· **customer-tier weighting** (enterprise outages cost ~3Γ— a free-tier outage)
47
+
48
+ > Wrong actor β†’ **βˆ’0.08**. Wrong root-cause on an enterprise ticket β†’ **βˆ’1.98**. Correct closure on an enterprise ticket β†’ **+1.44**. The rules matter β€” and every step tells you *why* it was scored.
49
+
50
+ ### The headline result
51
+
52
+ One picture, four policies, three difficulty tiers:
53
+
54
+ ![Reward curve comparing random, base LLM, fine-tuned LLM, and heuristic across easy, medium, and hard tasks](./artifacts/reward_curve.png)
55
+
56
+ | Policy | What is it? | Hard-tier reward |
57
+ |---|---|---:|
58
+ | πŸ”΄ **Random** | Picks an action uniformly | **βˆ’12.50** |
59
+ | 🟠 **Base Qwen2.5-1.5B** | Off-the-shelf LLM, **no fine-tuning** | **βˆ’4.28** |
60
+ | 🟒 **Our fine-tuned LLM** | Same model, SFT on 680 rollout examples | **+5.89** |
61
+ | πŸ”΅ **Heuristic (oracle)** | Human-written "ideal" policy | **+5.89** |
62
+
63
+ > **The AI went from βˆ’4.28 β†’ +5.89 on hard incidents β€” a +10.17 reward swing β€” and matched the human expert component-for-component.**
64
+
65
+ ### What did the agent actually learn?
66
+
67
+ Not "which label to pick." It learned **a whole workflow** β€” and the reward rubric makes that visible:
68
+
69
+ ![Stacked-bar chart showing where each policy earns or loses reward, broken down by rubric component](./artifacts/reward_components.png)
70
+
71
+ | Before fine-tuning 🟠 | After fine-tuning 🟒 |
72
+ |---|---|
73
+ | Only earns `clue_bonus` (+0.24) | Unlocks **`closure_correct +7.36`** Β· **`mitigation_correct +2.10`** Β· **`postmortem_bonus +0.60`** |
74
+ | Bleeds `step_cost` (βˆ’5.16) and `sla_exhausted` (βˆ’5.04) | Respects the SLA β†’ **zero** `sla_exhausted` |
75
+ | Closes **0** incidents correctly | Closes incidents **like the expert does** |
76
+ | "Looks busy" but times out | Actually solves the problem |
77
+
78
+ ### How training went (the short version)
79
+
80
+ ![SFT loss dropping from ~2.84 to ~0.02 and token accuracy climbing from ~0.49 to ~0.99 over 3 epochs](./artifacts/training_curve.png)
81
+
82
+ | Step | What happened |
83
+ |---|---|
84
+ | 1. **Collect** | Run the expert heuristic over every incident β†’ **680 rollout examples** (prompt = observation, completion = structured action) |
85
+ | 2. **Supervise** | TRL `SFTTrainer`, 3 epochs β†’ loss **2.84 β†’ 0.02**, token accuracy **0.49 β†’ 0.99** |
86
+ | 3. **Evaluate** | Re-run random / heuristic / base-LLM / SFT-LLM under identical seeds |
87
+ | 4. **Plot** | Reward curve, training curve, reward-component breakdown β€” all committed to [`artifacts/`](./artifacts) |
88
+
89
+ ### The surprise finding β€” size matters
90
+
91
+ Same pipeline, same data recipe, smaller backbone:
92
+
93
+ | Backbone | Dataset rows | Base β†’ SFT on **hard** | Hard incidents closed |
94
+ |---|---:|---:|---|
95
+ | Qwen2.5-**0.5B**-Instruct | 255 | **+0.00** | **0** |
96
+ | Qwen2.5-**1.5B**-Instruct | 680 | **+10.17** | full expert behavior |
97
+
98
+ > At **0.5B** the model is *too small* to absorb this multi-step, role-gated policy even with perfect supervision. At **1.5B** capacity is suddenly sufficient and behavior cloning converges. The rubric surfaces this β€” it's not hidden inside a single aggregate score.
99
+
100
+ ### Why this environment hits all three hackathon themes
101
+
102
+ | Theme | How we satisfy it |
103
+ |---|---|
104
+ | **#1 Multi-agent** | Three roles with **different permissions** who have to cooperate. Wrong-actor calls are punished (βˆ’0.08). Correct handoff is rewarded (+0.15). |
105
+ | **#2 Long-horizon** | Each episode runs **3–5 sequential incidents**, 20–60 steps each, under one ticking SLA clock. The big reward (+0.80 Γ— tier) only fires after clues β†’ fix β†’ post-mortem. Sparse and delayed by design. |
106
+ | **#3 Professional world-model** | Real tech incidents with **logs, metrics, KB articles, red-herring signals, customer-tier revenue impact, SLA clocks**. Close an enterprise ticket wrong and it hurts ~3Γ— what a free-tier one does. |
107
+
108
+ ### Try it in 30 seconds
109
+
110
+ | | |
111
+ |---|---|
112
+ | 🟒 **Live environment** | **[Open the dashboard β†—](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** |
113
+ | πŸ’» **Source code** | **[GitHub repo β†—](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center)** |
114
+ | πŸŽ“ **Reproduce the training** | **[One-click Colab notebook β†—](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** |
115
+ | πŸ“Ί **2-minute video walkthrough** | *Coming soon β€” shot list in [`docs/VIDEO_SCRIPT.md`](./docs/VIDEO_SCRIPT.md)* |
116
+ | πŸ“ **Mini blog post** | *Coming soon β€” full draft in [`docs/BLOG_POST.md`](./docs/BLOG_POST.md)* |
117
+
118
+ > Want the rubric math, architecture, full numbers, configuration, and the hackathon checklist? Keep scrolling β€” **Part 2** is the full technical README.
119
+
120
+ ---
121
+
122
+ ## Part 2 β€” Technical deep dive
123
+
124
  ### Live links
125
 
126
  | What | Where |