Spaces:
Sleeping
Teaching a 7B Model to Be On-Call
An OpenEnv benchmark and a four-stage GRPO pipeline that turns Qwen2.5-7B into a working SRE triage agent
TL;DR. We built
incident_env— an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained Qwen2.5-7B-Instruct through a four-stage curriculum (baseline rollouts → LoRA SFT → online GRPO withr_cross→ merge). The post-trained model reaches a mean cumulative reward of ≈1.59 vs ≈0.49 for the base, at less than half the steps, with tighter variance and dominant CDF across the operating range.
🧭 One-line pitch. Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still — memory climbs, alerts cascade, and the obvious symptom is almost never the cause.
1 · Why this benchmark didn't exist yet
Pick any list of agentic LLM benchmarks today and you'll see two clusters:
| Cluster | Examples | What they miss |
|---|---|---|
| Frozen-repo coding | SWE-bench, RepoBench, HumanEval | No evolving system, no observability, no alerts |
| Tool-use chains | AgentBench, ToolBench, τ-bench | Plenty of API calls, but no reactive simulator |
Neither cluster matches the workflow that consumes the most engineer-hours at any company running real systems: on-call triage. A pager fires. A graph is wrong. Three services look broken but only one is broken. Someone has to triangulate, propose a fix, and identify the offending commit — under SLA pressure, with partial information.
That gap is exactly what incident_env fills.
✦ Capability gap. Today's LLMs can read a static repo. They cannot yet diagnose a system whose state changes while they're looking at it.
2 · Environment at a glance
incident_env is an OpenEnv Environment — clean Gym-style reset() / step() / state plus a /score endpoint for the oracle-independent grader. Under the hood it is a reactive, partially-observable, two-phase simulator.
Topology — seven reactive services
┌─────────┐ ┌─────┐ ┌────────┐ ┌─────────┐
│ API GW │───▶│Auth │───▶│ Orders │───▶│ Payment │
└────┬────┘ └─────┘ └───┬────┘ └────┬────┘
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Cache │ │ DB │ │ Queue │
└─────────┘ └─────────┘ └─────────┘
Each service has live metric history (CPU, memory, p50/p95/p99 latency, error rate, RPS), structured logs, deploy history, and a healthy | degraded | down status. Faults propagate along this graph each tick(). Restarting a downstream service buys minutes; rolling back the wrong deploy makes things worse.
The agent loop
Per-step execution is validate → mutate → tick → observe → reward. Two facts make the loop interesting:
- The observation never exposes
fault_type, theis_baddeploy flag, or any internal simulation state. The agent infers from symptoms. - The action space is hierarchical and masked.
valid_actions[]is recomputed every step, so illegal actions (e.g. rollback on a service with no deploy history) are flagged with a-0.05penalty.
3 · Two-phase action design (this is the novel bit)
Most environments give the agent one type of tool. Ours gives it two — and forces a deliberate transition between them.
stateDiagram-v2
[*] --> Phase1
state Phase1 {
[*] --> Investigating
Investigating --> Investigating : view_alerts / query_logs / check_metrics<br/>check_dependencies / check_deploy_history<br/>run_health_check
Investigating --> Remediating : restart_service / rollback_deploy / scale_service
Remediating --> Investigating
Investigating --> Declared : declare_root_cause
}
Phase1 --> Phase2 : transition_to_phase2(belief)
state Phase2 {
[*] --> Exploring
Exploring --> Exploring : list_dir / read_file / search_code<br/>get_git_log / get_file_diff
Exploring --> Patched : propose_patch / declare_no_change
}
Patched --> [*]
Declared --> [*]
Phase 1 — ops investigation
The same tools an SRE has at 3 AM, plus a transition_to_phase2 control action that hands a structured BeliefState over to Phase 2:
| Action | Category | Purpose |
|---|---|---|
view_alerts |
diagnostic | List firing alerts |
query_logs |
diagnostic | Filter by service/level/keyword |
check_metrics |
diagnostic | 30-min time series |
check_dependencies |
diagnostic | Up/downstream graph |
check_deploy_history |
diagnostic | Recent deploys |
run_health_check |
diagnostic | Ping a service |
restart_service |
remediation | Temporary fix |
rollback_deploy |
remediation | Real fix if root cause |
scale_service |
remediation | More replicas |
declare_root_cause |
terminal | Diagnosis string |
transition_to_phase2 |
control | Hand off to code attribution |
Phase 2 — code attribution
When a scenario has a code_context, the env spins up a sandboxed CodeWorkspace over a bundled mini-repo:
snapshots/<scenario>/
tree/ ← actual source files
git_log.json ← commits (sha, author, date, msg, files)
diffs/<sha>.patch ← unified diff per commit
Five new actions appear, all sandboxed (no .., no symlinks, no real subprocess):
| Action | What it returns |
|---|---|
list_dir |
files + subdirs at a relative path |
read_file |
up to 64 KB of file contents |
search_code |
grep across the tree, capped at 50 hits |
get_git_log |
commit metadata for a path |
get_file_diff |
unified diff for (commit_sha, path) |
propose_patch |
terminal — submit a unified diff |
declare_no_change |
terminal — for spurious-issue scenarios |
✦ Why two phases? Real triage is two phases. Mixing them in one action soup forces the agent to learn a strategy: gather enough Phase-1 evidence to make Phase-2 cheap, but don't dawdle. This single design decision is what gives
r_cross(Section 5) something meaningful to reward.
4 · Reward design — two layers, kept separate by design
┌───────────────────────────────────────────────────────────────┐
│ LAYER 1 · Per-step shaped reward (TRAINING ONLY) │
│ peeks at hidden state to give a useful gradient │
├───────────────────────────────────────────────────────────────┤
│ diagnostic on involved svc +0.15 │
│ diagnostic on uninvolved svc +0.05 │
│ remediation on root-cause svc +0.30 │
│ correct root cause declaration +0.40 │
│ per-step efficiency cost −0.02 │
│ repeat / invalid −0.05 │
│ wrong-target remediation −0.15 │
└───────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ LAYER 2 · Oracle-independent grader (EVALUATION) │
│ sees only the trajectory + declared patch │
├───────────────────────────────────────────────────────────────┤
│ p1_rca 25 % keyword/AST match │
│ p1_efficiency 15 % fewer steps to declare │
│ patch_quality 35 % file overlap + AST + syntax │
│ no_change_detection 25 % spurious-issue scenarios │
│ p2_efficiency 25 % used when valid issue │
└───────────────────────────────────────────────────────────────┘
Patch quality has three tiers: file overlap (Jaccard), AST-level hunk similarity, and syntax validity — none of which read hidden state. Saved trajectories can be re-graded months later from a JSONL file alone.
r_cross — the counterfactual that makes joint training work
r_cross(τ) = max(0, r_code(τ_2 | context(τ_1)) − r_code(τ_2 | ∅))
Where:
| Symbol | Meaning |
|---|---|
τ (tau) |
A full episode trajectory (a sequence of observation–action–reward steps). |
τ_1 |
The Phase-1 sub-trajectory of τ (ops investigation steps only). |
τ_2 |
The Phase-2 sub-trajectory of τ (code-attribution steps only). |
r_code(...) |
The Phase-2 grader score (patch quality + no-change detection), in [0, 1]. |
context(τ_1) |
The structured belief handed off from Phase 1 to Phase 2 (suspected service, fault class, confidences, evidence gaps). |
∅ (null context) |
An empty handoff — Phase 2 starts with no Phase-1 evidence. Score measured separately on Pool B. |
max(0, ·) |
Clamp to non-negative; we never punish Phase 1 for inherently hard bugs. |
− |
Counterfactual difference: how much did Phase 1 actually help? |
In English: how much did Phase 1's investigation actually help the code agent vs. starting from a null context? r_cross is what makes the joint training signal meaningful — without it, Phase 1 has no incentive to produce a useful handoff, only a plausible one. We will show in the ablations that turning r_cross off collapses ~80 % of the lift.
5 ·Scenario flavours
| Task | Hidden lesson |
|---|---|
memory_leak |
Single service, noisy metric — restart only buys minutes |
cascading_failure |
Loud services aren't the cause — must walk the dep graph |
distributed_deadlock |
Three remediation actions, in a specific order |
aliased_fault |
Queue worker leaks like a memory leak — symptoms alias |
severity_inversion |
SEV1 page, two-line fix in orders/auth_client.py |
confidence_inversion |
Loud alerts on the wrong service; real bug is a lock-ordering issue |
info_ordering |
Decisive evidence shows up late — early committers lose |
circuit_breaker_noop |
Spurious issue; the right answer is declare_no_change |
heldout_* (×2) |
Compounds of the above; never seen during training |
6 · The training pipeline
Architecture — what GRPO is actually optimising
Before the stage-by-stage detail, here is the architectural view: a three-level hierarchy with an orchestrator routing policy on top, two specialised subagents below it, and segment-level GRPO with cross-phase reward propagation underneath both.
Three things to notice in this picture:
- The orchestrator owns the stopping criterion. Deciding when Phase 1 has gathered enough evidence to hand off is a learned policy, not a rule. The orchestrator emits a structured
BeliefState(suspected_service,fault_class, confidences,evidence_gaps) at every transition decision — making the criterion auditable and supervisable. - The subagents are specialised but share weights. P1 (ops) and P2 (code) are the same Qwen2.5-7B-Instruct LoRA adapter prompted differently per phase. We train them in pool-isolated stages first, then jointly with
r_crossswitched on. - The reward signal is segment-level, not trajectory-level. Episodes are 8–16 k tokens; one scalar reward over the whole thing dilutes credit. Each phase becomes its own GRPO group;
r_crossis added to the Phase-1 group return with stop-gradient on the Phase-2 path (training/segment_grpo.py). That single architectural choice is what lets joint training avoid poisoning Phase-1 gradients with Phase-2 noise.
The big picture (rendered SVG at the top of the post) shows the data flow Base → SFT → GRPO → Merge. The diagram above shows the gradient flow that lives inside the GRPO box. Stage-by-stage detail below — kept tight.
Stage 1 · Baseline rollouts
sre_finetune_collector.py drives the deployed environment over the HuggingFace Inference API (Qwen/Qwen2.5-7B-Instruct:fastest). Episodes are sampled across all four pools with weights A=0.35, B=0.20, C=0.35, D=0.10. Negative-reward episodes are kept as hard negatives — there's no quality filter on rollouts.
Three artefacts written incrementally:
sre_raw_trajectories.jsonl — full episode + score breakdown
sre_sft_dataset.jsonl — one row per (observation, action) step
sre_grpo_dataset.jsonl — (prompt, chosen, rejected) preference pairs
Stage 2 · LoRA SFT (TRL)
Built on TRL's SFTTrainer with PEFT/LoRA — the minimum-requirements training stack named in RULES.md.
# sft.py
trainer = SFTTrainer(
model = model, # Qwen2.5-7B-Instruct
args = training_args, # bf16, packing on
train_dataset = dataset[script_args.dataset_train_split],
eval_dataset = dataset[script_args.dataset_test_split],
peft_config = get_peft_config(model_args), # LoRA: r=32, α=16
)
trainer.train()
| Setting | Value |
|---|---|
| Base | Qwen/Qwen2.5-7B-Instruct |
| LoRA | r=32, α=16, dropout=0.05 on {q,k,v,o}_proj |
| LR / epochs | 2e-4 / 1 |
| Effective batch | 2 × 8 accum = 16 |
| Precision | bf16 + packing |
| Hardware | 1× A100-40GB |
LoRA notation.
ris the rank of the low-rank update matricesA ∈ ℝ^{d×r}, B ∈ ℝ^{r×d}injected into each target linear; the effective weight delta isΔW = (α/r) · B A, soαis a scaling coefficient (not a learning rate).dropoutis applied toAactivations during training. Target modules{q,k,v,o}_projare the four attention-projection linears in each transformer block.
Stage 3 · Post-SFT trajectories
Because the SFT model is ours, we provisioned an A100 manually and ran inference via plain transformers — no API. This produced the n=64 Pool C trajectories used as the GRPO warm-start corpus and the SFT reference distribution in the CDF (Section 7, blue curve).
Stage 4 · Online GRPO
training/grpo_train.py implements on-policy GRPO (Group Relative Policy Optimisation): K=4 rollouts per prompt with the current policy → within-group reward standardisation → clipped PPO-style loss with a KL penalty against a frozen reference model.
# training/grpo_train.py — the actual update
ratio = torch.exp(plp - rlp.detach())
unclipped = ratio * adv
clipped = torch.clamp(ratio, 1 - clip, 1 + clip) * adv
pg_loss = -torch.min(unclipped, clipped)
kl_loss = beta * (rlp.detach() - plp)
loss = (pg_loss + kl_loss).sum() / n_tokens
Where:
| Symbol | Meaning |
|---|---|
plp |
Per-token log-probability of the recorded assistant turn under the policy (current, trainable model). |
rlp |
Same per-token log-probability under the reference model (frozen base; .detach() blocks gradient). |
ratio = exp(plp − rlp) |
Importance-sampling ratio of policy / reference — equals 1.0 when they agree. |
adv |
The advantage for the segment, computed from the within-group return: A_i = (R_i − μ_R) / (σ_R + ε) where R_i = terminal_reward + r_cross_i, μ_R, σ_R are the mean/stdev of returns inside the K-rollout group, and ε = 1e-6 for numerical stability. |
clip (PPO ε) |
Trust-region width: 0.2. Caps how far ratio can move before the gradient is clipped. |
pg_loss |
Clipped policy-gradient loss (negative because we minimise). |
beta (β) |
KL penalty coefficient: 0.04. Trades exploration vs. drift from the reference. |
kl_loss |
Per-token forward-KL approximation β · (rlp − plp), pulling the policy toward the reference. |
n_tokens |
Total assistant tokens in the group — normalises so loss magnitude is independent of generation length. |
Curriculum:
| Stage | Pool | Mode | What gets trained |
|---|---|---|---|
| 2 | A | p1_only |
Ops policy only |
| 3 | B | p2_only |
Code policy only (oracle handoff) |
| 4 | C | joint |
Full P1 → P2 with r_cross on |
Two safety scaffolds in training/variance_gate.py:
- Variance gate — Stage 4 doesn't open until ≥4 tasks show stable
r_codevariance (stdev ≤ 0.15 over 64 samples). r_crosswarmup — linear ramp 0 → 1 over the first 500 Stage-4 steps.
| Setting | Value | What it controls |
|---|---|---|
| LoRA | r=16, α=32, dropout=0.05 on {q,k,v,o}_proj |
Trainable adapter capacity (see Stage 2 box). |
| Learning rate | 1e-5 |
AdamW step size on LoRA params only. |
β (KL coeff) |
0.04 |
Penalty pulling policy toward frozen reference; larger = more conservative. |
clip (PPO ε) |
0.2 |
Width of the trust region in the clipped surrogate. |
Group size K |
4 |
Rollouts per prompt used to compute within-group advantage. |
| Episodes / task | 64 |
Per stage; split across the K-rollout groups. |
Stage 5 · Merge
The smallest file in the repo and the one that makes everything deployable:
# merge.py
base_model = "Qwen/Qwen2.5-7B-Instruct"
lora_model = "daemongg/qwen2.5-7b-sre-grpo"
output_repo = "Yaswanth-Bolla/qwen-merged"
model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, lora_model)
model = model.merge_and_unload()
model.push_to_hub(output_repo)
The output is a vanilla causal LM that vLLM, TGI, or plain transformers can load with no idea it had adapters.
7 · Results
Figure 1 — Reward distribution (CDF)
Empirical CDF of cumulative reward — lower curve = better (more probability mass at high reward).
- Baseline (green dashed, n=80): long left tail; ~40 % of rollouts under 0.75.
- SFT (blue, n=64): consistent — fewer catastrophes, modest median.
- Posttrained RL (red, n=100): dominates across nearly every quantile, with the steepest climb between 0.4 and 0.75 — that's where GRPO concentrated mass.
Figure 2 — Efficiency curve (reward vs. steps)
| Model | Mean reward by ~30 steps | Steps to plateau | σ at plateau |
|---|---|---|---|
| Baseline | ~0.20 | never within 60 steps | wide |
| SFT | ~0.95 | ~50 steps | medium |
| Posttrained RL | ~1.59 | ~25 steps | tight |
✦ The operationally meaningful number isn't the +1.10 reward — it's that the post-trained model gets there in half the wall-clock steps. Fewer pages, less time-to-resolution.
Component breakdown — Pool C (oracle-independent grader, n ≈ 100)
| Metric | Base | RL | Δ |
|---|---|---|---|
mean_final |
0.4495 | 0.4537 | ▲ 0.0042 |
mean_p1_steps |
16.62 | 15.75 | ▼ 0.87 |
mean_p2_steps |
5.62 | 6.50 | ▲ 0.88 |
mean_r_cross |
0.4412 | 0.4662 | ▲ 0.025 |
The per-step grader's
mean_finalmoves only marginally on Pool C — the visible win is in cumulative reward, CDF dominance, andr_cross(+0.025), which is the actual training signal we cared about. The +0.88 P2-steps shift is intentional: the RL model learned to use the code workspace before patching, instead of one-shotting a wrong diff.
Held-out — Pool D (n ≈ 16)
| Metric | Base | RL | Δ |
|---|---|---|---|
mean_final |
0.5565 | 0.5284 | ▼ 0.0281 |
| Pearson r (P2 breadth) | +0.4951 | −0.3637 | ▼ 0.8588 |
⚠ We're flagging this honestly. On the two compositional held-out scenarios, RL is slightly worse than baseline. The strong negative Pearson on P2 breadth tells us why: the RL model commits to a narrow code search early; on truly novel compounds, the base model's naïve breadth-first browsing is a better strategy. Fix path is in §9.
8 · Ablations
A · r_cross on vs. off — the most informative knob
| Condition | Δ mean_final (FT − Base) |
Δ mean_r_cross |
|---|---|---|
r_cross_on |
▲ 0.0256 | ▲ 0.169 |
r_cross_off |
▲ 0.0054 | 0 |
Without the counterfactual reward, the fine-tuning gap shrinks ~80 %. Phase 1 has no incentive to produce a useful belief if you don't reward Phase 2 for using it.
B · Stopping behaviour shifts by allocation, not total
The fine-tuned model transitions to Phase 2 0.87 steps earlier and spends 0.88 steps more inside Phase 2. Net step count is roughly flat — but the budget allocation improved. Less dashboard, more code.
C · Source-type contribution
| Source removed | Δ mean_final (Pool C) |
|---|---|
| Logs only | ▼ 0.04 |
| Metrics only | ▼ 0.07 |
| Git log + diffs | ▼ 0.13 |
| Mini-repo file tree | ▼ 0.18 |
Code attribution is the single biggest contributor. Take away the repo and the agent loses ~40 % of its lift.
D · Convergence proxy
| Metric | Fine-tuned | Base |
|---|---|---|
| Early-window mean_final | 0.7475 | 0.6425 |
| Late-window mean_final | 0.4255 | 0.4620 |
Fine-tuned starts hotter and decays — has memorised some training-distribution heuristics. Consistent with the Pool D regression. This is the clearest place to push next.
9 · Limitations & honest caveats
- Pool D regression. RL underperforms base by 0.028 on held-out compounds. Fix: Pool-D-shaped curriculum data + entropy bonus.
- Calibration regresses. ECE 0.58 → 0.81 — RL is more confident without being more correct. The
BeliefStateaux-loss intraining/belief_aux_loss.pyis the place to wire it back in. - Sample sizes are honest, not heroic. Baseline n=80, SFT n=64, RL n=100; held-out n=16. Take the held-out number as directional.
- No code execution. Phase 2 is read-only. Adding a sandboxed
pytestaction would close the largest fraction of remaining capability gap. - Minimal system prompt. A more elaborate scratchpad/belief-state prompt likely closes the SFT→RL gap further. We'd consider that a positive signal for the environment.
10 · Closing
We set out to answer one question: can a small open model, trained against a faithful incident-response simulator, become competitively useful at SRE triage?
On the training distribution: yes, clearly. On novel compounds: not yet, but the training signal we built (r_cross) and the curriculum that uses it are correctly oriented toward fixing that. And the most durable artefact from this submission isn't the score — it's the stack:
| Artefact | Where |
|---|---|
| OpenEnv environment | incident_env (this repo) |
| Hosted Space | meta-hf-hackathon-updated-policy.hf.space |
| LoRA adapter | daemongg/qwen2.5-7b-sre-grpo |
| Merged model | Yaswanth-Bolla/qwen-merged |
| Trajectories | sre_*_dataset.jsonl (in repo) |
| Training scripts | sft.py, training/grpo_train.py, merge.py |




