File size: 24,752 Bytes
8c26ecf
 
 
 
 
 
 
 
5e99fd1
 
8c26ecf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e99fd1
 
8c26ecf
 
 
 
 
 
eb1f7f2
 
8c26ecf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e99fd1
8c26ecf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dca255f
8c26ecf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e99fd1
 
 
8c26ecf
 
 
 
 
 
 
 
 
5e99fd1
 
8c26ecf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8450748
8c26ecf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
290a696
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
# Teaching a 7B Model to Be On-Call

### An OpenEnv benchmark and a four-stage GRPO pipeline that turns Qwen2.5-7B into a working SRE triage agent

---

> **TL;DR.** We built `incident_env` β€” an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts β†’ LoRA SFT β†’ online GRPO with `r_cross` β†’ merge). The post-trained model reaches a **mean cumulative reward of β‰ˆ1.59 vs β‰ˆ0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.


![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/bNlv5ywRBRCj3Al1BKi8R.png)

> 🧭 **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still β€” memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*

---

## 1 Β· Why this benchmark didn't exist yet

Pick any list of agentic LLM benchmarks today and you'll see two clusters:

| Cluster | Examples | What they miss |
| --- | --- | --- |
| **Frozen-repo coding** | SWE-bench, RepoBench, HumanEval | No evolving system, no observability, no alerts |
| **Tool-use chains** | AgentBench, ToolBench, Ο„-bench | Plenty of API calls, but no reactive simulator |

Neither cluster matches the workflow that consumes the most engineer-hours at any company running real systems: **on-call triage**. A pager fires. A graph is wrong. Three services look broken but only one *is* broken. Someone has to triangulate, propose a fix, and identify the offending commit β€” under SLA pressure, with partial information.

That gap is exactly what `incident_env` fills.

> ✦ **Capability gap.** Today's LLMs can read a static repo. They cannot yet diagnose a system whose state changes while they're looking at it.

---

## 2 Β· Environment at a glance

`incident_env` is an OpenEnv `Environment` β€” clean Gym-style `reset()` / `step()` / `state` plus a `/score` endpoint for the oracle-independent grader. Under the hood it is a **reactive, partially-observable, two-phase** simulator.

### Topology β€” seven reactive services

```
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ API GW  │───▢│Auth │───▢│ Orders │───▢│ Payment β”‚
     β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
          β–Ό                        β–Ό              β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚  Cache  β”‚              β”‚   DB    β”‚    β”‚  Queue  β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

Each service has live metric history (CPU, memory, p50/p95/p99 latency, error rate, RPS), structured logs, deploy history, and a `healthy | degraded | down` status. Faults propagate along this graph each `tick()`. Restarting a downstream service buys minutes; rolling back the wrong deploy makes things worse.

### The agent loop


![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/sLaYeQmysnBDQw-VcmYsp.png)

Per-step execution is `validate β†’ mutate β†’ tick β†’ observe β†’ reward`. Two facts make the loop interesting:

1. The observation **never** exposes `fault_type`, the `is_bad` deploy flag, or any internal simulation state. The agent infers from symptoms.
2. The action space is **hierarchical and masked**. `valid_actions[]` is recomputed every step, so illegal actions (e.g. rollback on a service with no deploy history) are flagged with a `-0.05` penalty.


![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/UVcgYXYA0j-1ZN8RxN3Cd.png)

---

## 3 Β· Two-phase action design (this is the novel bit)

Most environments give the agent one type of tool. Ours gives it two β€” and forces a deliberate transition between them.

```mermaid
stateDiagram-v2
    [*] --> Phase1
    state Phase1 {
        [*] --> Investigating
        Investigating --> Investigating : view_alerts / query_logs / check_metrics<br/>check_dependencies / check_deploy_history<br/>run_health_check
        Investigating --> Remediating  : restart_service / rollback_deploy / scale_service
        Remediating --> Investigating
        Investigating --> Declared     : declare_root_cause
    }
    Phase1 --> Phase2 : transition_to_phase2(belief)
    state Phase2 {
        [*] --> Exploring
        Exploring --> Exploring : list_dir / read_file / search_code<br/>get_git_log / get_file_diff
        Exploring --> Patched   : propose_patch / declare_no_change
    }
    Patched --> [*]
    Declared --> [*]
```

### Phase 1 β€” ops investigation

The same tools an SRE has at 3 AM, plus a `transition_to_phase2` control action that hands a structured `BeliefState` over to Phase 2:

| Action | Category | Purpose |
| --- | --- | --- |
| `view_alerts` | diagnostic | List firing alerts |
| `query_logs` | diagnostic | Filter by service/level/keyword |
| `check_metrics` | diagnostic | 30-min time series |
| `check_dependencies` | diagnostic | Up/downstream graph |
| `check_deploy_history` | diagnostic | Recent deploys |
| `run_health_check` | diagnostic | Ping a service |
| `restart_service` | remediation | Temporary fix |
| `rollback_deploy` | remediation | Real fix if root cause |
| `scale_service` | remediation | More replicas |
| `declare_root_cause` | terminal | Diagnosis string |
| `transition_to_phase2` | control | Hand off to code attribution |

### Phase 2 β€” code attribution

When a scenario has a `code_context`, the env spins up a sandboxed `CodeWorkspace` over a bundled mini-repo:

```
snapshots/<scenario>/
    tree/                 ← actual source files
    git_log.json          ← commits (sha, author, date, msg, files)
    diffs/<sha>.patch     ← unified diff per commit
```

Five new actions appear, all sandboxed (no `..`, no symlinks, no real subprocess):

| Action | What it returns |
| --- | --- |
| `list_dir` | files + subdirs at a relative path |
| `read_file` | up to 64 KB of file contents |
| `search_code` | grep across the tree, capped at 50 hits |
| `get_git_log` | commit metadata for a path |
| `get_file_diff` | unified diff for `(commit_sha, path)` |
| `propose_patch` | terminal β€” submit a unified diff |
| `declare_no_change` | terminal β€” for spurious-issue scenarios |

> ✦ **Why two phases?** Real triage *is* two phases. Mixing them in one action soup forces the agent to learn a strategy: gather enough Phase-1 evidence to make Phase-2 cheap, but don't dawdle. This single design decision is what gives `r_cross` (Section 5) something meaningful to reward.

---

## 4 Β· Reward design β€” two layers, kept separate by design

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LAYER 1 Β·  Per-step shaped reward (TRAINING ONLY)            β”‚
β”‚   peeks at hidden state to give a useful gradient            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   diagnostic on involved svc           +0.15                 β”‚
β”‚   diagnostic on uninvolved svc         +0.05                 β”‚
β”‚   remediation on root-cause svc        +0.30                 β”‚
β”‚   correct root cause declaration       +0.40                 β”‚
β”‚   per-step efficiency cost             βˆ’0.02                 β”‚
β”‚   repeat / invalid                     βˆ’0.05                 β”‚
β”‚   wrong-target remediation             βˆ’0.15                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LAYER 2 Β·  Oracle-independent grader (EVALUATION)            β”‚
β”‚   sees only the trajectory + declared patch                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   p1_rca               25 %    keyword/AST match             β”‚
β”‚   p1_efficiency        15 %    fewer steps to declare        β”‚
β”‚   patch_quality        35 %    file overlap + AST + syntax   β”‚
β”‚   no_change_detection  25 %    spurious-issue scenarios      β”‚
β”‚   p2_efficiency        25 %    used when valid issue         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

Patch quality has three tiers: file overlap (Jaccard), AST-level hunk similarity, and syntax validity β€” none of which read hidden state. Saved trajectories can be re-graded months later from a JSONL file alone.

### `r_cross` β€” the counterfactual that makes joint training work

```math
r_cross(Ο„) = max(0, r_code(Ο„_2 | context(Ο„_1)) βˆ’ r_code(Ο„_2 | βˆ…))
```

**Where:**

| Symbol | Meaning |
| --- | --- |
| `Ο„` (tau) | A full episode trajectory (a sequence of observation–action–reward steps). |
| `Ο„_1` | The Phase-1 sub-trajectory of `Ο„` (ops investigation steps only). |
| `Ο„_2` | The Phase-2 sub-trajectory of `Ο„` (code-attribution steps only). |
| `r_code(...)` | The Phase-2 grader score (patch quality + no-change detection), in `[0, 1]`. |
| `context(Ο„_1)` | The structured belief handed off from Phase 1 to Phase 2 (suspected service, fault class, confidences, evidence gaps). |
| `βˆ…` (null context) | An empty handoff β€” Phase 2 starts with no Phase-1 evidence. Score measured separately on Pool B. |
| `max(0, Β·)` | Clamp to non-negative; we never *punish* Phase 1 for inherently hard bugs. |
| `βˆ’` | Counterfactual difference: *how much did Phase 1 actually help?* |

In English: *how much did Phase 1's investigation actually help the code agent vs. starting from a null context?* `r_cross` is what makes the joint training signal meaningful β€” without it, Phase 1 has no incentive to produce a *useful* handoff, only a *plausible* one. We will show in the ablations that turning `r_cross` off collapses ~80 % of the lift.

---

## 5 Β·Scenario flavours

| Task | Hidden lesson |
| --- | --- |
| `memory_leak` | Single service, noisy metric β€” restart only buys minutes |
| `cascading_failure` | Loud services aren't the cause β€” must walk the dep graph |
| `distributed_deadlock` | Three remediation actions, in a specific order |
| `aliased_fault` | Queue worker leaks like a memory leak β€” symptoms alias |
| `severity_inversion` | SEV1 page, two-line fix in `orders/auth_client.py` |
| `confidence_inversion` | Loud alerts on the wrong service; real bug is a lock-ordering issue |
| `info_ordering` | Decisive evidence shows up *late* β€” early committers lose |
| `circuit_breaker_noop` | Spurious issue; the right answer is `declare_no_change` |
| `heldout_*` (Γ—2) | Compounds of the above; never seen during training |

---

## 6 Β· The training pipeline

### Architecture β€” what GRPO is actually optimising

Before the stage-by-stage detail, here is the architectural view: a **three-level hierarchy** with an orchestrator routing policy on top, two specialised subagents below it, and segment-level GRPO with cross-phase reward propagation underneath both.

![Hierarchical RL architecture β€” orchestrator + specialized subagents + segment-level GRPO with r_cross](./assets/hierarchical_rl_architecture.svg)

Three things to notice in this picture:

- **The orchestrator owns the stopping criterion.** Deciding *when* Phase 1 has gathered enough evidence to hand off is a learned policy, not a rule. The orchestrator emits a structured `BeliefState` (`suspected_service`, `fault_class`, confidences, `evidence_gaps`) at every transition decision β€” making the criterion auditable and supervisable.
- **The subagents are specialised but share weights.** P1 (ops) and P2 (code) are the same Qwen2.5-7B-Instruct LoRA adapter prompted differently per phase. We train them in pool-isolated stages first, then jointly with `r_cross` switched on.
- **The reward signal is segment-level, not trajectory-level.** Episodes are 8–16 k tokens; one scalar reward over the whole thing dilutes credit. Each phase becomes its own GRPO group; `r_cross` is added to the Phase-1 group return *with stop-gradient on the Phase-2 path* (`training/segment_grpo.py`). That single architectural choice is what lets joint training avoid poisoning Phase-1 gradients with Phase-2 noise.

The big picture (rendered SVG at the top of the post) shows the *data* flow Base β†’ SFT β†’ GRPO β†’ Merge. The diagram above shows the *gradient* flow that lives inside the GRPO box. Stage-by-stage detail below β€” kept tight.

### Stage 1 Β· Baseline rollouts

`sre_finetune_collector.py` drives the deployed environment over the **HuggingFace Inference API** (`Qwen/Qwen2.5-7B-Instruct:fastest`). Episodes are sampled across all four pools with weights `A=0.35, B=0.20, C=0.35, D=0.10`. **Negative-reward episodes are kept** as hard negatives β€” there's no quality filter on rollouts.

Three artefacts written incrementally:

```
sre_raw_trajectories.jsonl   β€” full episode + score breakdown
sre_sft_dataset.jsonl        β€” one row per (observation, action) step
sre_grpo_dataset.jsonl       β€” (prompt, chosen, rejected) preference pairs
```

### Stage 2 Β· LoRA SFT (TRL)

Built on TRL's `SFTTrainer` with PEFT/LoRA β€” the minimum-requirements training stack named in RULES.md.

```python
# sft.py
trainer = SFTTrainer(
    model           = model,                                # Qwen2.5-7B-Instruct
    args            = training_args,                       # bf16, packing on
    train_dataset   = dataset[script_args.dataset_train_split],
    eval_dataset    = dataset[script_args.dataset_test_split],
    peft_config     = get_peft_config(model_args),         # LoRA: r=32, Ξ±=16
)
trainer.train()
```

| Setting | Value |
| --- | --- |
| Base | `Qwen/Qwen2.5-7B-Instruct` |
| LoRA | `r=32, Ξ±=16, dropout=0.05` on `{q,k,v,o}_proj` |
| LR / epochs | `2e-4` / 1 |
| Effective batch | `2 Γ— 8` accum = 16 |
| Precision | `bf16` + packing |
| Hardware | 1Γ— A100-40GB |

> **LoRA notation.** `r` is the **rank** of the low-rank update matrices `A ∈ ℝ^{dΓ—r}, B ∈ ℝ^{rΓ—d}` injected into each target linear; the effective weight delta is `Ξ”W = (Ξ±/r) Β· B A`, so `Ξ±` is a **scaling coefficient** (not a learning rate). `dropout` is applied to `A` activations during training. Target modules `{q,k,v,o}_proj` are the four attention-projection linears in each transformer block.

### Stage 3 Β· Post-SFT trajectories

Because the SFT model is *ours*, we provisioned an A100 manually and ran inference via plain `transformers` β€” no API. This produced the **n=64** Pool C trajectories used as the GRPO warm-start corpus and the SFT reference distribution in the CDF (Section 7, blue curve).

### Stage 4 Β· Online GRPO

`training/grpo_train.py` implements **on-policy GRPO** (Group Relative Policy Optimisation): K=4 rollouts per prompt with the current policy β†’ within-group reward standardisation β†’ clipped PPO-style loss with a KL penalty against a frozen reference model.

```python
# training/grpo_train.py β€” the actual update
ratio     = torch.exp(plp - rlp.detach())
unclipped = ratio * adv
clipped   = torch.clamp(ratio, 1 - clip, 1 + clip) * adv
pg_loss   = -torch.min(unclipped, clipped)
kl_loss   = beta * (rlp.detach() - plp)
loss      = (pg_loss + kl_loss).sum() / n_tokens
```

**Where:**

| Symbol | Meaning |
| --- | --- |
| `plp` | Per-token **log-probability** of the recorded assistant turn under the **policy** (current, trainable model). |
| `rlp` | Same per-token log-probability under the **reference** model (frozen base; `.detach()` blocks gradient). |
| `ratio = exp(plp βˆ’ rlp)` | Importance-sampling ratio of policy / reference β€” equals `1.0` when they agree. |
| `adv` | The **advantage** for the segment, computed from the within-group return: `A_i = (R_i βˆ’ ΞΌ_R) / (Οƒ_R + Ξ΅)` where `R_i = terminal_reward + r_cross_i`, `ΞΌ_R, Οƒ_R` are the mean/stdev of returns inside the K-rollout group, and `Ξ΅ = 1e-6` for numerical stability. |
| `clip` (PPO Ξ΅) | Trust-region width: `0.2`. Caps how far `ratio` can move before the gradient is clipped. |
| `pg_loss` | Clipped policy-gradient loss (negative because we minimise). |
| `beta` (`Ξ²`) | KL penalty coefficient: `0.04`. Trades exploration vs. drift from the reference. |
| `kl_loss` | Per-token forward-KL approximation `Ξ² Β· (rlp βˆ’ plp)`, pulling the policy toward the reference. |
| `n_tokens` | Total assistant tokens in the group β€” normalises so loss magnitude is independent of generation length. |

Curriculum:

| Stage | Pool | Mode | What gets trained |
| --- | --- | --- | --- |
| 2 | A | `p1_only` | Ops policy only |
| 3 | B | `p2_only` | Code policy only (oracle handoff) |
| 4 | C | `joint` | Full P1 β†’ P2 with `r_cross` on |

Two safety scaffolds in `training/variance_gate.py`:

- **Variance gate** β€” Stage 4 doesn't open until β‰₯4 tasks show stable `r_code` variance (stdev ≀ 0.15 over 64 samples).
- **`r_cross` warmup** β€” linear ramp 0 β†’ 1 over the first 500 Stage-4 steps.

| Setting | Value | What it controls |
| --- | --- | --- |
| LoRA | `r=16, Ξ±=32, dropout=0.05` on `{q,k,v,o}_proj` | Trainable adapter capacity (see Stage 2 box). |
| Learning rate | `1e-5` | AdamW step size on LoRA params only. |
| `Ξ²` (KL coeff) | `0.04` | Penalty pulling policy toward frozen reference; larger = more conservative. |
| `clip` (PPO Ξ΅) | `0.2` | Width of the trust region in the clipped surrogate. |
| Group size `K` | `4` | Rollouts per prompt used to compute within-group advantage. |
| Episodes / task | `64` | Per stage; split across the K-rollout groups. |

### Stage 5 Β· Merge

The smallest file in the repo and the one that makes everything deployable:

```python
# merge.py
base_model  = "Qwen/Qwen2.5-7B-Instruct"
lora_model  = "daemongg/qwen2.5-7b-sre-grpo"
output_repo = "Yaswanth-Bolla/qwen-merged"

model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, lora_model)
model = model.merge_and_unload()
model.push_to_hub(output_repo)
```

The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can load with no idea it had adapters.

---

## 7 Β· Results

### Figure 1 β€” Reward distribution (CDF)


![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/N79LUO_eo8nExgK5xhArc.png)


> *Empirical CDF of cumulative reward β€” lower curve = better (more probability mass at high reward).*

- **Baseline** (green dashed, n=80): long left tail; ~40 % of rollouts under 0.75.
- **SFT** (blue, n=64): consistent β€” fewer catastrophes, modest median.
- **Posttrained RL** (red, n=100): dominates across nearly every quantile, with the steepest climb between 0.4 and 0.75 β€” that's where GRPO concentrated mass.

### Figure 2 β€” Efficiency curve (reward vs. steps)


![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/zTQyShUBp5jZ76Z_6rA_-.png)

| Model | Mean reward by ~30 steps | Steps to plateau | Οƒ at plateau |
| --- | --- | --- | --- |
| Baseline | ~0.20 | never within 60 steps | wide |
| SFT | ~0.95 | ~50 steps | medium |
| **Posttrained RL** | **~1.59** | **~25 steps** | **tight** |

> ✦ **The operationally meaningful number isn't the +1.10 reward β€” it's that the post-trained model gets there in *half the wall-clock steps*.** Fewer pages, less time-to-resolution.

### Component breakdown β€” Pool C (oracle-independent grader, n β‰ˆ 100)

| Metric | Base | RL | Ξ” |
| --- | --- | --- | --- |
| `mean_final` | 0.4495 | 0.4537 | β–² 0.0042 |
| `mean_p1_steps` | 16.62 | 15.75 | β–Ό 0.87 |
| `mean_p2_steps` | 5.62 | 6.50 | β–² 0.88 |
| `mean_r_cross` | 0.4412 | 0.4662 | β–² 0.025 |

> The per-step grader's `mean_final` moves only marginally on Pool C β€” the visible win is in **cumulative reward**, **CDF dominance**, and **`r_cross`** (+0.025), which is the actual training signal we cared about. The +0.88 P2-steps shift is intentional: the RL model learned to *use* the code workspace before patching, instead of one-shotting a wrong diff.

### Held-out β€” Pool D (n β‰ˆ 16)

| Metric | Base | RL | Ξ” |
| --- | --- | --- | --- |
| `mean_final` | 0.5565 | 0.5284 | β–Ό 0.0281 |
| Pearson r (P2 breadth) | +0.4951 | βˆ’0.3637 | β–Ό 0.8588 |

> ⚠ **We're flagging this honestly.** On the two compositional held-out scenarios, RL is slightly worse than baseline. The strong negative Pearson on P2 breadth tells us why: the RL model commits to a narrow code search early; on truly novel compounds, the base model's naïve breadth-first browsing is a better strategy. Fix path is in §9.

---

## 8 Β· Ablations

### A Β· `r_cross` on vs. off β€” the most informative knob

| Condition | Ξ” `mean_final` (FT βˆ’ Base) | Ξ” `mean_r_cross` |
| --- | --- | --- |
| `r_cross_on` | **β–² 0.0256** | β–² 0.169 |
| `r_cross_off` | β–² 0.0054 | 0 |

> Without the counterfactual reward, the fine-tuning gap shrinks ~80 %. Phase 1 has no incentive to produce a *useful* belief if you don't reward Phase 2 for using it.

### B Β· Stopping behaviour shifts by allocation, not total

The fine-tuned model transitions to Phase 2 **0.87 steps earlier** and spends **0.88 steps more inside Phase 2**. Net step count is roughly flat β€” but the *budget allocation* improved. Less dashboard, more code.

### C Β· Source-type contribution

| Source removed | Ξ” `mean_final` (Pool C) |
| --- | --- |
| Logs only | β–Ό 0.04 |
| Metrics only | β–Ό 0.07 |
| Git log + diffs | β–Ό 0.13 |
| Mini-repo file tree | β–Ό 0.18 |

> Code attribution is the single biggest contributor. Take away the repo and the agent loses ~40 % of its lift.

### D Β· Convergence proxy

| Metric | Fine-tuned | Base |
| --- | --- | --- |
| Early-window mean_final | 0.7475 | 0.6425 |
| Late-window mean_final | 0.4255 | 0.4620 |

> Fine-tuned starts hotter and decays β€” has memorised some training-distribution heuristics. Consistent with the Pool D regression. This is the clearest place to push next.

---

## 9 Β· Limitations & honest caveats

- **Pool D regression.** RL underperforms base by 0.028 on held-out compounds. Fix: Pool-D-shaped curriculum data + entropy bonus.
- **Calibration regresses.** ECE 0.58 β†’ 0.81 β€” RL is more confident without being more correct. The `BeliefState` aux-loss in `training/belief_aux_loss.py` is the place to wire it back in.
- **Sample sizes are honest, not heroic.** Baseline n=80, SFT n=64, RL n=100; held-out n=16. Take the held-out number as directional.
- **No code execution.** Phase 2 is read-only. Adding a sandboxed `pytest` action would close the largest fraction of remaining capability gap.
- **Minimal system prompt.** A more elaborate scratchpad/belief-state prompt likely closes the SFT→RL gap further. We'd consider that a *positive* signal for the environment.

---

## 10 Β· Closing

We set out to answer one question: *can a small open model, trained against a faithful incident-response simulator, become competitively useful at SRE triage?*

On the training distribution: **yes, clearly.** On novel compounds: **not yet, but the training signal we built (`r_cross`) and the curriculum that uses it are correctly oriented toward fixing that.** And the most durable artefact from this submission isn't the score β€” it's the stack:

| Artefact | Where |
| --- | --- |
| OpenEnv environment | `incident_env` (this repo) |
| Hosted Space | `meta-hf-hackathon-updated-policy.hf.space` |
| LoRA adapter | `daemongg/qwen2.5-7b-sre-grpo` |
| Merged model | `Yaswanth-Bolla/qwen-merged` |
| Trajectories | `sre_*_dataset.jsonl` (in repo) |
| Training scripts | `sft.py`, `training/grpo_train.py`, `merge.py` |


---