srinjoyd commited on
Commit
5e99fd1
Β·
verified Β·
1 Parent(s): 59d92e0

Update BLOG.md

Browse files
Files changed (1) hide show
  1. BLOG.md +10 -40
BLOG.md CHANGED
@@ -6,7 +6,8 @@
6
 
7
  > **TL;DR.** We built `incident_env` β€” an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts β†’ LoRA SFT β†’ online GRPO with `r_cross` β†’ merge). The post-trained model reaches a **mean cumulative reward of β‰ˆ1.59 vs β‰ˆ0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.
8
 
9
- ![SRE Triage Bot training pipeline](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/blob/main/assets/pipeline.svg)
 
10
 
11
  > 🧭 **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still β€” memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
12
 
@@ -49,7 +50,8 @@ Each service has live metric history (CPU, memory, p50/p95/p99 latency, error ra
49
 
50
  ### The agent loop
51
 
52
- ![Agent loop POMDP](./assets/agent_loop.svg)
 
53
 
54
  Per-step execution is `validate β†’ mutate β†’ tick β†’ observe β†’ reward`. Two facts make the loop interesting:
55
 
@@ -196,42 +198,7 @@ In English: *how much did Phase 1's investigation actually help the code agent v
196
 
197
  ---
198
 
199
- ## 5 Β· Tasks and pools
200
-
201
- ```mermaid
202
- flowchart TD
203
- Tasks["10 scenarios"] --> Easy["memory_leak"]
204
- Tasks --> Med["cascading_failure"]
205
- Tasks --> Hard["distributed_deadlock"]
206
- Tasks --> A1["aliased_fault"]
207
- Tasks --> A2["severity_inversion"]
208
- Tasks --> A3["confidence_inversion"]
209
- Tasks --> A4["info_ordering"]
210
- Tasks --> A5["circuit_breaker_noop Β· no-change"]
211
- Tasks --> H1["heldout_aliased_severity Β· compound"]
212
- Tasks --> H2["heldout_confidence_ordering Β· compound"]
213
-
214
- H1 --> PoolD
215
- H2 --> PoolD
216
-
217
- Easy --> PoolA & PoolB & PoolC
218
- Med --> PoolA & PoolB & PoolC
219
- Hard --> PoolA & PoolB & PoolC
220
- A1 --> PoolA & PoolB & PoolC
221
- A2 --> PoolA & PoolB & PoolC
222
- A3 --> PoolA & PoolB & PoolC
223
- A4 --> PoolA & PoolB & PoolC
224
- A5 --> PoolA & PoolC
225
-
226
- PoolA["Pool A Β· p1_only Β· ops bootstrap"]
227
- PoolB["Pool B Β· p2_only Β· code bootstrap with oracle handoff"]
228
- PoolC["Pool C Β· joint Β· full P1β†’P2 with r_cross"]
229
- PoolD["Pool D Β· joint Β· held-out compounds"]
230
- ```
231
-
232
- > ✦ **Pool D is the integrity check.** Each component fault family appears during training, but the *combinations* never do. This is what answers "did the agent learn a strategy or memorise scenario fingerprints?"
233
-
234
- ### Scenario flavours
235
 
236
  | Task | Hidden lesson |
237
  | --- | --- |
@@ -380,7 +347,9 @@ The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can lo
380
 
381
  ### Figure 1 β€” Reward distribution (CDF)
382
 
383
- ![Reward CDF per source β€” Baseline, SFT, Posttrained RL](./assets/cdf.png)
 
 
384
 
385
  > *Empirical CDF of cumulative reward β€” lower curve = better (more probability mass at high reward).*
386
 
@@ -390,7 +359,8 @@ The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can lo
390
 
391
  ### Figure 2 β€” Efficiency curve (reward vs. steps)
392
 
393
- ![Efficiency curve β€” mean cumulative reward vs total steps (P1 + P2)](./assets/efficiency.png)
 
394
 
395
  | Model | Mean reward by ~30 steps | Steps to plateau | Οƒ at plateau |
396
  | --- | --- | --- | --- |
 
6
 
7
  > **TL;DR.** We built `incident_env` β€” an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts β†’ LoRA SFT β†’ online GRPO with `r_cross` β†’ merge). The post-trained model reaches a **mean cumulative reward of β‰ˆ1.59 vs β‰ˆ0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.
8
 
9
+
10
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/bNlv5ywRBRCj3Al1BKi8R.png)
11
 
12
  > 🧭 **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still β€” memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
13
 
 
50
 
51
  ### The agent loop
52
 
53
+
54
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/sLaYeQmysnBDQw-VcmYsp.png)
55
 
56
  Per-step execution is `validate β†’ mutate β†’ tick β†’ observe β†’ reward`. Two facts make the loop interesting:
57
 
 
198
 
199
  ---
200
 
201
+ ## 5 Β·Scenario flavours
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
202
 
203
  | Task | Hidden lesson |
204
  | --- | --- |
 
347
 
348
  ### Figure 1 β€” Reward distribution (CDF)
349
 
350
+
351
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/N79LUO_eo8nExgK5xhArc.png)
352
+
353
 
354
  > *Empirical CDF of cumulative reward β€” lower curve = better (more probability mass at high reward).*
355
 
 
359
 
360
  ### Figure 2 β€” Efficiency curve (reward vs. steps)
361
 
362
+
363
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/zTQyShUBp5jZ76Z_6rA_-.png)
364
 
365
  | Model | Mean reward by ~30 steps | Steps to plateau | Οƒ at plateau |
366
  | --- | --- | --- | --- |