Spaces:
Sleeping
Sleeping
File size: 24,752 Bytes
8c26ecf 5e99fd1 8c26ecf 5e99fd1 8c26ecf eb1f7f2 8c26ecf 5e99fd1 8c26ecf dca255f 8c26ecf 5e99fd1 8c26ecf 5e99fd1 8c26ecf 8450748 8c26ecf 290a696 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 | # Teaching a 7B Model to Be On-Call
### An OpenEnv benchmark and a four-stage GRPO pipeline that turns Qwen2.5-7B into a working SRE triage agent
---
> **TL;DR.** We built `incident_env` β an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts β LoRA SFT β online GRPO with `r_cross` β merge). The post-trained model reaches a **mean cumulative reward of β1.59 vs β0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.

> π§ **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still β memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
---
## 1 Β· Why this benchmark didn't exist yet
Pick any list of agentic LLM benchmarks today and you'll see two clusters:
| Cluster | Examples | What they miss |
| --- | --- | --- |
| **Frozen-repo coding** | SWE-bench, RepoBench, HumanEval | No evolving system, no observability, no alerts |
| **Tool-use chains** | AgentBench, ToolBench, Ο-bench | Plenty of API calls, but no reactive simulator |
Neither cluster matches the workflow that consumes the most engineer-hours at any company running real systems: **on-call triage**. A pager fires. A graph is wrong. Three services look broken but only one *is* broken. Someone has to triangulate, propose a fix, and identify the offending commit β under SLA pressure, with partial information.
That gap is exactly what `incident_env` fills.
> β¦ **Capability gap.** Today's LLMs can read a static repo. They cannot yet diagnose a system whose state changes while they're looking at it.
---
## 2 Β· Environment at a glance
`incident_env` is an OpenEnv `Environment` β clean Gym-style `reset()` / `step()` / `state` plus a `/score` endpoint for the oracle-independent grader. Under the hood it is a **reactive, partially-observable, two-phase** simulator.
### Topology β seven reactive services
```
βββββββββββ βββββββ ββββββββββ βββββββββββ
β API GW βββββΆβAuth βββββΆβ Orders βββββΆβ Payment β
ββββββ¬βββββ βββββββ βββββ¬βββββ ββββββ¬βββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Cache β β DB β β Queue β
βββββββββββ βββββββββββ βββββββββββ
```
Each service has live metric history (CPU, memory, p50/p95/p99 latency, error rate, RPS), structured logs, deploy history, and a `healthy | degraded | down` status. Faults propagate along this graph each `tick()`. Restarting a downstream service buys minutes; rolling back the wrong deploy makes things worse.
### The agent loop

Per-step execution is `validate β mutate β tick β observe β reward`. Two facts make the loop interesting:
1. The observation **never** exposes `fault_type`, the `is_bad` deploy flag, or any internal simulation state. The agent infers from symptoms.
2. The action space is **hierarchical and masked**. `valid_actions[]` is recomputed every step, so illegal actions (e.g. rollback on a service with no deploy history) are flagged with a `-0.05` penalty.

---
## 3 Β· Two-phase action design (this is the novel bit)
Most environments give the agent one type of tool. Ours gives it two β and forces a deliberate transition between them.
```mermaid
stateDiagram-v2
[*] --> Phase1
state Phase1 {
[*] --> Investigating
Investigating --> Investigating : view_alerts / query_logs / check_metrics<br/>check_dependencies / check_deploy_history<br/>run_health_check
Investigating --> Remediating : restart_service / rollback_deploy / scale_service
Remediating --> Investigating
Investigating --> Declared : declare_root_cause
}
Phase1 --> Phase2 : transition_to_phase2(belief)
state Phase2 {
[*] --> Exploring
Exploring --> Exploring : list_dir / read_file / search_code<br/>get_git_log / get_file_diff
Exploring --> Patched : propose_patch / declare_no_change
}
Patched --> [*]
Declared --> [*]
```
### Phase 1 β ops investigation
The same tools an SRE has at 3 AM, plus a `transition_to_phase2` control action that hands a structured `BeliefState` over to Phase 2:
| Action | Category | Purpose |
| --- | --- | --- |
| `view_alerts` | diagnostic | List firing alerts |
| `query_logs` | diagnostic | Filter by service/level/keyword |
| `check_metrics` | diagnostic | 30-min time series |
| `check_dependencies` | diagnostic | Up/downstream graph |
| `check_deploy_history` | diagnostic | Recent deploys |
| `run_health_check` | diagnostic | Ping a service |
| `restart_service` | remediation | Temporary fix |
| `rollback_deploy` | remediation | Real fix if root cause |
| `scale_service` | remediation | More replicas |
| `declare_root_cause` | terminal | Diagnosis string |
| `transition_to_phase2` | control | Hand off to code attribution |
### Phase 2 β code attribution
When a scenario has a `code_context`, the env spins up a sandboxed `CodeWorkspace` over a bundled mini-repo:
```
snapshots/<scenario>/
tree/ β actual source files
git_log.json β commits (sha, author, date, msg, files)
diffs/<sha>.patch β unified diff per commit
```
Five new actions appear, all sandboxed (no `..`, no symlinks, no real subprocess):
| Action | What it returns |
| --- | --- |
| `list_dir` | files + subdirs at a relative path |
| `read_file` | up to 64 KB of file contents |
| `search_code` | grep across the tree, capped at 50 hits |
| `get_git_log` | commit metadata for a path |
| `get_file_diff` | unified diff for `(commit_sha, path)` |
| `propose_patch` | terminal β submit a unified diff |
| `declare_no_change` | terminal β for spurious-issue scenarios |
> β¦ **Why two phases?** Real triage *is* two phases. Mixing them in one action soup forces the agent to learn a strategy: gather enough Phase-1 evidence to make Phase-2 cheap, but don't dawdle. This single design decision is what gives `r_cross` (Section 5) something meaningful to reward.
---
## 4 Β· Reward design β two layers, kept separate by design
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 1 Β· Per-step shaped reward (TRAINING ONLY) β
β peeks at hidden state to give a useful gradient β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β diagnostic on involved svc +0.15 β
β diagnostic on uninvolved svc +0.05 β
β remediation on root-cause svc +0.30 β
β correct root cause declaration +0.40 β
β per-step efficiency cost β0.02 β
β repeat / invalid β0.05 β
β wrong-target remediation β0.15 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 2 Β· Oracle-independent grader (EVALUATION) β
β sees only the trajectory + declared patch β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β p1_rca 25 % keyword/AST match β
β p1_efficiency 15 % fewer steps to declare β
β patch_quality 35 % file overlap + AST + syntax β
β no_change_detection 25 % spurious-issue scenarios β
β p2_efficiency 25 % used when valid issue β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
Patch quality has three tiers: file overlap (Jaccard), AST-level hunk similarity, and syntax validity β none of which read hidden state. Saved trajectories can be re-graded months later from a JSONL file alone.
### `r_cross` β the counterfactual that makes joint training work
```math
r_cross(Ο) = max(0, r_code(Ο_2 | context(Ο_1)) β r_code(Ο_2 | β
))
```
**Where:**
| Symbol | Meaning |
| --- | --- |
| `Ο` (tau) | A full episode trajectory (a sequence of observationβactionβreward steps). |
| `Ο_1` | The Phase-1 sub-trajectory of `Ο` (ops investigation steps only). |
| `Ο_2` | The Phase-2 sub-trajectory of `Ο` (code-attribution steps only). |
| `r_code(...)` | The Phase-2 grader score (patch quality + no-change detection), in `[0, 1]`. |
| `context(Ο_1)` | The structured belief handed off from Phase 1 to Phase 2 (suspected service, fault class, confidences, evidence gaps). |
| `β
` (null context) | An empty handoff β Phase 2 starts with no Phase-1 evidence. Score measured separately on Pool B. |
| `max(0, Β·)` | Clamp to non-negative; we never *punish* Phase 1 for inherently hard bugs. |
| `β` | Counterfactual difference: *how much did Phase 1 actually help?* |
In English: *how much did Phase 1's investigation actually help the code agent vs. starting from a null context?* `r_cross` is what makes the joint training signal meaningful β without it, Phase 1 has no incentive to produce a *useful* handoff, only a *plausible* one. We will show in the ablations that turning `r_cross` off collapses ~80 % of the lift.
---
## 5 Β·Scenario flavours
| Task | Hidden lesson |
| --- | --- |
| `memory_leak` | Single service, noisy metric β restart only buys minutes |
| `cascading_failure` | Loud services aren't the cause β must walk the dep graph |
| `distributed_deadlock` | Three remediation actions, in a specific order |
| `aliased_fault` | Queue worker leaks like a memory leak β symptoms alias |
| `severity_inversion` | SEV1 page, two-line fix in `orders/auth_client.py` |
| `confidence_inversion` | Loud alerts on the wrong service; real bug is a lock-ordering issue |
| `info_ordering` | Decisive evidence shows up *late* β early committers lose |
| `circuit_breaker_noop` | Spurious issue; the right answer is `declare_no_change` |
| `heldout_*` (Γ2) | Compounds of the above; never seen during training |
---
## 6 Β· The training pipeline
### Architecture β what GRPO is actually optimising
Before the stage-by-stage detail, here is the architectural view: a **three-level hierarchy** with an orchestrator routing policy on top, two specialised subagents below it, and segment-level GRPO with cross-phase reward propagation underneath both.

Three things to notice in this picture:
- **The orchestrator owns the stopping criterion.** Deciding *when* Phase 1 has gathered enough evidence to hand off is a learned policy, not a rule. The orchestrator emits a structured `BeliefState` (`suspected_service`, `fault_class`, confidences, `evidence_gaps`) at every transition decision β making the criterion auditable and supervisable.
- **The subagents are specialised but share weights.** P1 (ops) and P2 (code) are the same Qwen2.5-7B-Instruct LoRA adapter prompted differently per phase. We train them in pool-isolated stages first, then jointly with `r_cross` switched on.
- **The reward signal is segment-level, not trajectory-level.** Episodes are 8β16 k tokens; one scalar reward over the whole thing dilutes credit. Each phase becomes its own GRPO group; `r_cross` is added to the Phase-1 group return *with stop-gradient on the Phase-2 path* (`training/segment_grpo.py`). That single architectural choice is what lets joint training avoid poisoning Phase-1 gradients with Phase-2 noise.
The big picture (rendered SVG at the top of the post) shows the *data* flow Base β SFT β GRPO β Merge. The diagram above shows the *gradient* flow that lives inside the GRPO box. Stage-by-stage detail below β kept tight.
### Stage 1 Β· Baseline rollouts
`sre_finetune_collector.py` drives the deployed environment over the **HuggingFace Inference API** (`Qwen/Qwen2.5-7B-Instruct:fastest`). Episodes are sampled across all four pools with weights `A=0.35, B=0.20, C=0.35, D=0.10`. **Negative-reward episodes are kept** as hard negatives β there's no quality filter on rollouts.
Three artefacts written incrementally:
```
sre_raw_trajectories.jsonl β full episode + score breakdown
sre_sft_dataset.jsonl β one row per (observation, action) step
sre_grpo_dataset.jsonl β (prompt, chosen, rejected) preference pairs
```
### Stage 2 Β· LoRA SFT (TRL)
Built on TRL's `SFTTrainer` with PEFT/LoRA β the minimum-requirements training stack named in RULES.md.
```python
# sft.py
trainer = SFTTrainer(
model = model, # Qwen2.5-7B-Instruct
args = training_args, # bf16, packing on
train_dataset = dataset[script_args.dataset_train_split],
eval_dataset = dataset[script_args.dataset_test_split],
peft_config = get_peft_config(model_args), # LoRA: r=32, Ξ±=16
)
trainer.train()
```
| Setting | Value |
| --- | --- |
| Base | `Qwen/Qwen2.5-7B-Instruct` |
| LoRA | `r=32, Ξ±=16, dropout=0.05` on `{q,k,v,o}_proj` |
| LR / epochs | `2e-4` / 1 |
| Effective batch | `2 Γ 8` accum = 16 |
| Precision | `bf16` + packing |
| Hardware | 1Γ A100-40GB |
> **LoRA notation.** `r` is the **rank** of the low-rank update matrices `A β β^{dΓr}, B β β^{rΓd}` injected into each target linear; the effective weight delta is `ΞW = (Ξ±/r) Β· B A`, so `Ξ±` is a **scaling coefficient** (not a learning rate). `dropout` is applied to `A` activations during training. Target modules `{q,k,v,o}_proj` are the four attention-projection linears in each transformer block.
### Stage 3 Β· Post-SFT trajectories
Because the SFT model is *ours*, we provisioned an A100 manually and ran inference via plain `transformers` β no API. This produced the **n=64** Pool C trajectories used as the GRPO warm-start corpus and the SFT reference distribution in the CDF (Section 7, blue curve).
### Stage 4 Β· Online GRPO
`training/grpo_train.py` implements **on-policy GRPO** (Group Relative Policy Optimisation): K=4 rollouts per prompt with the current policy β within-group reward standardisation β clipped PPO-style loss with a KL penalty against a frozen reference model.
```python
# training/grpo_train.py β the actual update
ratio = torch.exp(plp - rlp.detach())
unclipped = ratio * adv
clipped = torch.clamp(ratio, 1 - clip, 1 + clip) * adv
pg_loss = -torch.min(unclipped, clipped)
kl_loss = beta * (rlp.detach() - plp)
loss = (pg_loss + kl_loss).sum() / n_tokens
```
**Where:**
| Symbol | Meaning |
| --- | --- |
| `plp` | Per-token **log-probability** of the recorded assistant turn under the **policy** (current, trainable model). |
| `rlp` | Same per-token log-probability under the **reference** model (frozen base; `.detach()` blocks gradient). |
| `ratio = exp(plp β rlp)` | Importance-sampling ratio of policy / reference β equals `1.0` when they agree. |
| `adv` | The **advantage** for the segment, computed from the within-group return: `A_i = (R_i β ΞΌ_R) / (Ο_R + Ξ΅)` where `R_i = terminal_reward + r_cross_i`, `ΞΌ_R, Ο_R` are the mean/stdev of returns inside the K-rollout group, and `Ξ΅ = 1e-6` for numerical stability. |
| `clip` (PPO Ξ΅) | Trust-region width: `0.2`. Caps how far `ratio` can move before the gradient is clipped. |
| `pg_loss` | Clipped policy-gradient loss (negative because we minimise). |
| `beta` (`Ξ²`) | KL penalty coefficient: `0.04`. Trades exploration vs. drift from the reference. |
| `kl_loss` | Per-token forward-KL approximation `Ξ² Β· (rlp β plp)`, pulling the policy toward the reference. |
| `n_tokens` | Total assistant tokens in the group β normalises so loss magnitude is independent of generation length. |
Curriculum:
| Stage | Pool | Mode | What gets trained |
| --- | --- | --- | --- |
| 2 | A | `p1_only` | Ops policy only |
| 3 | B | `p2_only` | Code policy only (oracle handoff) |
| 4 | C | `joint` | Full P1 β P2 with `r_cross` on |
Two safety scaffolds in `training/variance_gate.py`:
- **Variance gate** β Stage 4 doesn't open until β₯4 tasks show stable `r_code` variance (stdev β€ 0.15 over 64 samples).
- **`r_cross` warmup** β linear ramp 0 β 1 over the first 500 Stage-4 steps.
| Setting | Value | What it controls |
| --- | --- | --- |
| LoRA | `r=16, Ξ±=32, dropout=0.05` on `{q,k,v,o}_proj` | Trainable adapter capacity (see Stage 2 box). |
| Learning rate | `1e-5` | AdamW step size on LoRA params only. |
| `Ξ²` (KL coeff) | `0.04` | Penalty pulling policy toward frozen reference; larger = more conservative. |
| `clip` (PPO Ξ΅) | `0.2` | Width of the trust region in the clipped surrogate. |
| Group size `K` | `4` | Rollouts per prompt used to compute within-group advantage. |
| Episodes / task | `64` | Per stage; split across the K-rollout groups. |
### Stage 5 Β· Merge
The smallest file in the repo and the one that makes everything deployable:
```python
# merge.py
base_model = "Qwen/Qwen2.5-7B-Instruct"
lora_model = "daemongg/qwen2.5-7b-sre-grpo"
output_repo = "Yaswanth-Bolla/qwen-merged"
model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, lora_model)
model = model.merge_and_unload()
model.push_to_hub(output_repo)
```
The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can load with no idea it had adapters.
---
## 7 Β· Results
### Figure 1 β Reward distribution (CDF)

> *Empirical CDF of cumulative reward β lower curve = better (more probability mass at high reward).*
- **Baseline** (green dashed, n=80): long left tail; ~40 % of rollouts under 0.75.
- **SFT** (blue, n=64): consistent β fewer catastrophes, modest median.
- **Posttrained RL** (red, n=100): dominates across nearly every quantile, with the steepest climb between 0.4 and 0.75 β that's where GRPO concentrated mass.
### Figure 2 β Efficiency curve (reward vs. steps)

| Model | Mean reward by ~30 steps | Steps to plateau | Ο at plateau |
| --- | --- | --- | --- |
| Baseline | ~0.20 | never within 60 steps | wide |
| SFT | ~0.95 | ~50 steps | medium |
| **Posttrained RL** | **~1.59** | **~25 steps** | **tight** |
> β¦ **The operationally meaningful number isn't the +1.10 reward β it's that the post-trained model gets there in *half the wall-clock steps*.** Fewer pages, less time-to-resolution.
### Component breakdown β Pool C (oracle-independent grader, n β 100)
| Metric | Base | RL | Ξ |
| --- | --- | --- | --- |
| `mean_final` | 0.4495 | 0.4537 | β² 0.0042 |
| `mean_p1_steps` | 16.62 | 15.75 | βΌ 0.87 |
| `mean_p2_steps` | 5.62 | 6.50 | β² 0.88 |
| `mean_r_cross` | 0.4412 | 0.4662 | β² 0.025 |
> The per-step grader's `mean_final` moves only marginally on Pool C β the visible win is in **cumulative reward**, **CDF dominance**, and **`r_cross`** (+0.025), which is the actual training signal we cared about. The +0.88 P2-steps shift is intentional: the RL model learned to *use* the code workspace before patching, instead of one-shotting a wrong diff.
### Held-out β Pool D (n β 16)
| Metric | Base | RL | Ξ |
| --- | --- | --- | --- |
| `mean_final` | 0.5565 | 0.5284 | βΌ 0.0281 |
| Pearson r (P2 breadth) | +0.4951 | β0.3637 | βΌ 0.8588 |
> β **We're flagging this honestly.** On the two compositional held-out scenarios, RL is slightly worse than baseline. The strong negative Pearson on P2 breadth tells us why: the RL model commits to a narrow code search early; on truly novel compounds, the base model's naΓ―ve breadth-first browsing is a better strategy. Fix path is in Β§9.
---
## 8 Β· Ablations
### A Β· `r_cross` on vs. off β the most informative knob
| Condition | Ξ `mean_final` (FT β Base) | Ξ `mean_r_cross` |
| --- | --- | --- |
| `r_cross_on` | **β² 0.0256** | β² 0.169 |
| `r_cross_off` | β² 0.0054 | 0 |
> Without the counterfactual reward, the fine-tuning gap shrinks ~80 %. Phase 1 has no incentive to produce a *useful* belief if you don't reward Phase 2 for using it.
### B Β· Stopping behaviour shifts by allocation, not total
The fine-tuned model transitions to Phase 2 **0.87 steps earlier** and spends **0.88 steps more inside Phase 2**. Net step count is roughly flat β but the *budget allocation* improved. Less dashboard, more code.
### C Β· Source-type contribution
| Source removed | Ξ `mean_final` (Pool C) |
| --- | --- |
| Logs only | βΌ 0.04 |
| Metrics only | βΌ 0.07 |
| Git log + diffs | βΌ 0.13 |
| Mini-repo file tree | βΌ 0.18 |
> Code attribution is the single biggest contributor. Take away the repo and the agent loses ~40 % of its lift.
### D Β· Convergence proxy
| Metric | Fine-tuned | Base |
| --- | --- | --- |
| Early-window mean_final | 0.7475 | 0.6425 |
| Late-window mean_final | 0.4255 | 0.4620 |
> Fine-tuned starts hotter and decays β has memorised some training-distribution heuristics. Consistent with the Pool D regression. This is the clearest place to push next.
---
## 9 Β· Limitations & honest caveats
- **Pool D regression.** RL underperforms base by 0.028 on held-out compounds. Fix: Pool-D-shaped curriculum data + entropy bonus.
- **Calibration regresses.** ECE 0.58 β 0.81 β RL is more confident without being more correct. The `BeliefState` aux-loss in `training/belief_aux_loss.py` is the place to wire it back in.
- **Sample sizes are honest, not heroic.** Baseline n=80, SFT n=64, RL n=100; held-out n=16. Take the held-out number as directional.
- **No code execution.** Phase 2 is read-only. Adding a sandboxed `pytest` action would close the largest fraction of remaining capability gap.
- **Minimal system prompt.** A more elaborate scratchpad/belief-state prompt likely closes the SFTβRL gap further. We'd consider that a *positive* signal for the environment.
---
## 10 Β· Closing
We set out to answer one question: *can a small open model, trained against a faithful incident-response simulator, become competitively useful at SRE triage?*
On the training distribution: **yes, clearly.** On novel compounds: **not yet, but the training signal we built (`r_cross`) and the curriculum that uses it are correctly oriented toward fixing that.** And the most durable artefact from this submission isn't the score β it's the stack:
| Artefact | Where |
| --- | --- |
| OpenEnv environment | `incident_env` (this repo) |
| Hosted Space | `meta-hf-hackathon-updated-policy.hf.space` |
| LoRA adapter | `daemongg/qwen2.5-7b-sre-grpo` |
| Merged model | `Yaswanth-Bolla/qwen-merged` |
| Trajectories | `sre_*_dataset.jsonl` (in repo) |
| Training scripts | `sft.py`, `training/grpo_train.py`, `merge.py` |
--- |