Spaces:
Sleeping
Sleeping
Update BLOG.md
Browse files
BLOG.md
CHANGED
|
@@ -6,7 +6,8 @@
|
|
| 6 |
|
| 7 |
> **TL;DR.** We built `incident_env` β an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts β LoRA SFT β online GRPO with `r_cross` β merge). The post-trained model reaches a **mean cumulative reward of β1.59 vs β0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.
|
| 8 |
|
| 9 |
-
|
|
|
|
| 10 |
|
| 11 |
> π§ **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still β memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
|
| 12 |
|
|
@@ -49,7 +50,8 @@ Each service has live metric history (CPU, memory, p50/p95/p99 latency, error ra
|
|
| 49 |
|
| 50 |
### The agent loop
|
| 51 |
|
| 52 |
-
|
|
|
|
| 53 |
|
| 54 |
Per-step execution is `validate β mutate β tick β observe β reward`. Two facts make the loop interesting:
|
| 55 |
|
|
@@ -196,42 +198,7 @@ In English: *how much did Phase 1's investigation actually help the code agent v
|
|
| 196 |
|
| 197 |
---
|
| 198 |
|
| 199 |
-
## 5 Β·
|
| 200 |
-
|
| 201 |
-
```mermaid
|
| 202 |
-
flowchart TD
|
| 203 |
-
Tasks["10 scenarios"] --> Easy["memory_leak"]
|
| 204 |
-
Tasks --> Med["cascading_failure"]
|
| 205 |
-
Tasks --> Hard["distributed_deadlock"]
|
| 206 |
-
Tasks --> A1["aliased_fault"]
|
| 207 |
-
Tasks --> A2["severity_inversion"]
|
| 208 |
-
Tasks --> A3["confidence_inversion"]
|
| 209 |
-
Tasks --> A4["info_ordering"]
|
| 210 |
-
Tasks --> A5["circuit_breaker_noop Β· no-change"]
|
| 211 |
-
Tasks --> H1["heldout_aliased_severity Β· compound"]
|
| 212 |
-
Tasks --> H2["heldout_confidence_ordering Β· compound"]
|
| 213 |
-
|
| 214 |
-
H1 --> PoolD
|
| 215 |
-
H2 --> PoolD
|
| 216 |
-
|
| 217 |
-
Easy --> PoolA & PoolB & PoolC
|
| 218 |
-
Med --> PoolA & PoolB & PoolC
|
| 219 |
-
Hard --> PoolA & PoolB & PoolC
|
| 220 |
-
A1 --> PoolA & PoolB & PoolC
|
| 221 |
-
A2 --> PoolA & PoolB & PoolC
|
| 222 |
-
A3 --> PoolA & PoolB & PoolC
|
| 223 |
-
A4 --> PoolA & PoolB & PoolC
|
| 224 |
-
A5 --> PoolA & PoolC
|
| 225 |
-
|
| 226 |
-
PoolA["Pool A Β· p1_only Β· ops bootstrap"]
|
| 227 |
-
PoolB["Pool B Β· p2_only Β· code bootstrap with oracle handoff"]
|
| 228 |
-
PoolC["Pool C Β· joint Β· full P1βP2 with r_cross"]
|
| 229 |
-
PoolD["Pool D Β· joint Β· held-out compounds"]
|
| 230 |
-
```
|
| 231 |
-
|
| 232 |
-
> β¦ **Pool D is the integrity check.** Each component fault family appears during training, but the *combinations* never do. This is what answers "did the agent learn a strategy or memorise scenario fingerprints?"
|
| 233 |
-
|
| 234 |
-
### Scenario flavours
|
| 235 |
|
| 236 |
| Task | Hidden lesson |
|
| 237 |
| --- | --- |
|
|
@@ -380,7 +347,9 @@ The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can lo
|
|
| 380 |
|
| 381 |
### Figure 1 β Reward distribution (CDF)
|
| 382 |
|
| 383 |
-
|
|
|
|
|
|
|
| 384 |
|
| 385 |
> *Empirical CDF of cumulative reward β lower curve = better (more probability mass at high reward).*
|
| 386 |
|
|
@@ -390,7 +359,8 @@ The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can lo
|
|
| 390 |
|
| 391 |
### Figure 2 β Efficiency curve (reward vs. steps)
|
| 392 |
|
| 393 |
-
|
|
|
|
| 394 |
|
| 395 |
| Model | Mean reward by ~30 steps | Steps to plateau | Ο at plateau |
|
| 396 |
| --- | --- | --- | --- |
|
|
|
|
| 6 |
|
| 7 |
> **TL;DR.** We built `incident_env` β an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts β LoRA SFT β online GRPO with `r_cross` β merge). The post-trained model reaches a **mean cumulative reward of β1.59 vs β0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.
|
| 8 |
|
| 9 |
+
|
| 10 |
+

|
| 11 |
|
| 12 |
> π§ **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still β memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
|
| 13 |
|
|
|
|
| 50 |
|
| 51 |
### The agent loop
|
| 52 |
|
| 53 |
+
|
| 54 |
+

|
| 55 |
|
| 56 |
Per-step execution is `validate β mutate β tick β observe β reward`. Two facts make the loop interesting:
|
| 57 |
|
|
|
|
| 198 |
|
| 199 |
---
|
| 200 |
|
| 201 |
+
## 5 Β·Scenario flavours
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 202 |
|
| 203 |
| Task | Hidden lesson |
|
| 204 |
| --- | --- |
|
|
|
|
| 347 |
|
| 348 |
### Figure 1 β Reward distribution (CDF)
|
| 349 |
|
| 350 |
+
|
| 351 |
+

|
| 352 |
+
|
| 353 |
|
| 354 |
> *Empirical CDF of cumulative reward β lower curve = better (more probability mass at high reward).*
|
| 355 |
|
|
|
|
| 359 |
|
| 360 |
### Figure 2 β Efficiency curve (reward vs. steps)
|
| 361 |
|
| 362 |
+
|
| 363 |
+

|
| 364 |
|
| 365 |
| Model | Mean reward by ~30 steps | Steps to plateau | Ο at plateau |
|
| 366 |
| --- | --- | --- | --- |
|