Spaces:

Meta-HF-hackathon
/

updated-policy

Sleeping

App Files Files Community

updated-policy / BLOG.md

srinjoyd

Update BLOG.md

8450748 verified 24 days ago

preview code

raw

history blame contribute delete

24.8 kB

Teaching a 7B Model to Be On-Call

An OpenEnv benchmark and a four-stage GRPO pipeline that turns Qwen2.5-7B into a working SRE triage agent

TL;DR. We built incident_env — an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained Qwen2.5-7B-Instruct through a four-stage curriculum (baseline rollouts → LoRA SFT → online GRPO with r_cross → merge). The post-trained model reaches a mean cumulative reward of ≈1.59 vs ≈0.49 for the base, at less than half the steps, with tighter variance and dominant CDF across the operating range.

🧭 One-line pitch. Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still — memory climbs, alerts cascade, and the obvious symptom is almost never the cause.

1 · Why this benchmark didn't exist yet

Pick any list of agentic LLM benchmarks today and you'll see two clusters:

Cluster	Examples	What they miss
Frozen-repo coding	SWE-bench, RepoBench, HumanEval	No evolving system, no observability, no alerts
Tool-use chains	AgentBench, ToolBench, τ-bench	Plenty of API calls, but no reactive simulator

Neither cluster matches the workflow that consumes the most engineer-hours at any company running real systems: on-call triage. A pager fires. A graph is wrong. Three services look broken but only one is broken. Someone has to triangulate, propose a fix, and identify the offending commit — under SLA pressure, with partial information.

That gap is exactly what incident_env fills.

✦ Capability gap. Today's LLMs can read a static repo. They cannot yet diagnose a system whose state changes while they're looking at it.

2 · Environment at a glance

incident_env is an OpenEnv Environment — clean Gym-style reset() / step() / state plus a /score endpoint for the oracle-independent grader. Under the hood it is a reactive, partially-observable, two-phase simulator.

Topology — seven reactive services

     ┌─────────┐    ┌─────┐    ┌────────┐    ┌─────────┐
     │ API GW  │───▶│Auth │───▶│ Orders │───▶│ Payment │
     └────┬────┘    └─────┘    └───┬────┘    └────┬────┘
          ▼                        ▼              ▼
     ┌─────────┐              ┌─────────┐    ┌─────────┐
     │  Cache  │              │   DB    │    │  Queue  │
     └─────────┘              └─────────┘    └─────────┘

Each service has live metric history (CPU, memory, p50/p95/p99 latency, error rate, RPS), structured logs, deploy history, and a healthy | degraded | down status. Faults propagate along this graph each tick(). Restarting a downstream service buys minutes; rolling back the wrong deploy makes things worse.

The agent loop

Per-step execution is validate → mutate → tick → observe → reward. Two facts make the loop interesting:

The observation never exposes fault_type, the is_bad deploy flag, or any internal simulation state. The agent infers from symptoms.
The action space is hierarchical and masked. valid_actions[] is recomputed every step, so illegal actions (e.g. rollback on a service with no deploy history) are flagged with a -0.05 penalty.

3 · Two-phase action design (this is the novel bit)

Most environments give the agent one type of tool. Ours gives it two — and forces a deliberate transition between them.

stateDiagram-v2
    [*] --> Phase1
    state Phase1 {
        [*] --> Investigating
        Investigating --> Investigating : view_alerts / query_logs / check_metrics<br/>check_dependencies / check_deploy_history<br/>run_health_check
        Investigating --> Remediating  : restart_service / rollback_deploy / scale_service
        Remediating --> Investigating
        Investigating --> Declared     : declare_root_cause
    }
    Phase1 --> Phase2 : transition_to_phase2(belief)
    state Phase2 {
        [*] --> Exploring
        Exploring --> Exploring : list_dir / read_file / search_code<br/>get_git_log / get_file_diff
        Exploring --> Patched   : propose_patch / declare_no_change
    }
    Patched --> [*]
    Declared --> [*]

Phase 1 — ops investigation

The same tools an SRE has at 3 AM, plus a transition_to_phase2 control action that hands a structured BeliefState over to Phase 2:

Action	Category	Purpose
`view_alerts`	diagnostic	List firing alerts
`query_logs`	diagnostic	Filter by service/level/keyword
`check_metrics`	diagnostic	30-min time series
`check_dependencies`	diagnostic	Up/downstream graph
`check_deploy_history`	diagnostic	Recent deploys
`run_health_check`	diagnostic	Ping a service
`restart_service`	remediation	Temporary fix
`rollback_deploy`	remediation	Real fix if root cause
`scale_service`	remediation	More replicas
`declare_root_cause`	terminal	Diagnosis string
`transition_to_phase2`	control	Hand off to code attribution

Phase 2 — code attribution

When a scenario has a code_context, the env spins up a sandboxed CodeWorkspace over a bundled mini-repo:

snapshots/<scenario>/
    tree/                 ← actual source files
    git_log.json          ← commits (sha, author, date, msg, files)
    diffs/<sha>.patch     ← unified diff per commit

Five new actions appear, all sandboxed (no .., no symlinks, no real subprocess):

Action	What it returns
`list_dir`	files + subdirs at a relative path
`read_file`	up to 64 KB of file contents
`search_code`	grep across the tree, capped at 50 hits
`get_git_log`	commit metadata for a path
`get_file_diff`	unified diff for `(commit_sha, path)`
`propose_patch`	terminal — submit a unified diff
`declare_no_change`	terminal — for spurious-issue scenarios

✦ Why two phases? Real triage is two phases. Mixing them in one action soup forces the agent to learn a strategy: gather enough Phase-1 evidence to make Phase-2 cheap, but don't dawdle. This single design decision is what gives r_cross (Section 5) something meaningful to reward.

4 · Reward design — two layers, kept separate by design

┌───────────────────────────────────────────────────────────────┐
│ LAYER 1 ·  Per-step shaped reward (TRAINING ONLY)            │
│   peeks at hidden state to give a useful gradient            │
├───────────────────────────────────────────────────────────────┤
│   diagnostic on involved svc           +0.15                 │
│   diagnostic on uninvolved svc         +0.05                 │
│   remediation on root-cause svc        +0.30                 │
│   correct root cause declaration       +0.40                 │
│   per-step efficiency cost             −0.02                 │
│   repeat / invalid                     −0.05                 │
│   wrong-target remediation             −0.15                 │
└───────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌───────────────────────────────────────────────────────────────┐
│ LAYER 2 ·  Oracle-independent grader (EVALUATION)            │
│   sees only the trajectory + declared patch                  │
├───────────────────────────────────────────────────────────────┤
│   p1_rca               25 %    keyword/AST match             │
│   p1_efficiency        15 %    fewer steps to declare        │
│   patch_quality        35 %    file overlap + AST + syntax   │
│   no_change_detection  25 %    spurious-issue scenarios      │
│   p2_efficiency        25 %    used when valid issue         │
└───────────────────────────────────────────────────────────────┘

Patch quality has three tiers: file overlap (Jaccard), AST-level hunk similarity, and syntax validity — none of which read hidden state. Saved trajectories can be re-graded months later from a JSONL file alone.

`r_cross` — the counterfactual that makes joint training work

r_cross(τ) = max(0, r_code(τ_2 | context(τ_1)) − r_code(τ_2 | ∅))

Where:

Symbol	Meaning
`τ` (tau)	A full episode trajectory (a sequence of observation–action–reward steps).
`τ_1`	The Phase-1 sub-trajectory of `τ` (ops investigation steps only).
`τ_2`	The Phase-2 sub-trajectory of `τ` (code-attribution steps only).
`r_code(...)`	The Phase-2 grader score (patch quality + no-change detection), in `[0, 1]`.
`context(τ_1)`	The structured belief handed off from Phase 1 to Phase 2 (suspected service, fault class, confidences, evidence gaps).
`∅` (null context)	An empty handoff — Phase 2 starts with no Phase-1 evidence. Score measured separately on Pool B.
`max(0, ·)`	Clamp to non-negative; we never punish Phase 1 for inherently hard bugs.
`−`	Counterfactual difference: how much did Phase 1 actually help?

In English: how much did Phase 1's investigation actually help the code agent vs. starting from a null context? r_cross is what makes the joint training signal meaningful — without it, Phase 1 has no incentive to produce a useful handoff, only a plausible one. We will show in the ablations that turning r_cross off collapses ~80 % of the lift.

5 ·Scenario flavours

Task	Hidden lesson
`memory_leak`	Single service, noisy metric — restart only buys minutes
`cascading_failure`	Loud services aren't the cause — must walk the dep graph
`distributed_deadlock`	Three remediation actions, in a specific order
`aliased_fault`	Queue worker leaks like a memory leak — symptoms alias
`severity_inversion`	SEV1 page, two-line fix in `orders/auth_client.py`
`confidence_inversion`	Loud alerts on the wrong service; real bug is a lock-ordering issue
`info_ordering`	Decisive evidence shows up late — early committers lose
`circuit_breaker_noop`	Spurious issue; the right answer is `declare_no_change`
`heldout_*` (×2)	Compounds of the above; never seen during training

6 · The training pipeline

Architecture — what GRPO is actually optimising

Before the stage-by-stage detail, here is the architectural view: a three-level hierarchy with an orchestrator routing policy on top, two specialised subagents below it, and segment-level GRPO with cross-phase reward propagation underneath both.

Three things to notice in this picture:

The orchestrator owns the stopping criterion. Deciding when Phase 1 has gathered enough evidence to hand off is a learned policy, not a rule. The orchestrator emits a structured BeliefState (suspected_service, fault_class, confidences, evidence_gaps) at every transition decision — making the criterion auditable and supervisable.
The subagents are specialised but share weights. P1 (ops) and P2 (code) are the same Qwen2.5-7B-Instruct LoRA adapter prompted differently per phase. We train them in pool-isolated stages first, then jointly with r_cross switched on.
The reward signal is segment-level, not trajectory-level. Episodes are 8–16 k tokens; one scalar reward over the whole thing dilutes credit. Each phase becomes its own GRPO group; r_cross is added to the Phase-1 group return with stop-gradient on the Phase-2 path (training/segment_grpo.py). That single architectural choice is what lets joint training avoid poisoning Phase-1 gradients with Phase-2 noise.

The big picture (rendered SVG at the top of the post) shows the data flow Base → SFT → GRPO → Merge. The diagram above shows the gradient flow that lives inside the GRPO box. Stage-by-stage detail below — kept tight.

Stage 1 · Baseline rollouts

sre_finetune_collector.py drives the deployed environment over the HuggingFace Inference API (Qwen/Qwen2.5-7B-Instruct:fastest). Episodes are sampled across all four pools with weights A=0.35, B=0.20, C=0.35, D=0.10. Negative-reward episodes are kept as hard negatives — there's no quality filter on rollouts.

Three artefacts written incrementally:

sre_raw_trajectories.jsonl   — full episode + score breakdown
sre_sft_dataset.jsonl        — one row per (observation, action) step
sre_grpo_dataset.jsonl       — (prompt, chosen, rejected) preference pairs

Stage 2 · LoRA SFT (TRL)

Built on TRL's SFTTrainer with PEFT/LoRA — the minimum-requirements training stack named in RULES.md.

# sft.py
trainer = SFTTrainer(
    model           = model,                                # Qwen2.5-7B-Instruct
    args            = training_args,                       # bf16, packing on
    train_dataset   = dataset[script_args.dataset_train_split],
    eval_dataset    = dataset[script_args.dataset_test_split],
    peft_config     = get_peft_config(model_args),         # LoRA: r=32, α=16
)
trainer.train()

Setting	Value
Base	`Qwen/Qwen2.5-7B-Instruct`
LoRA	`r=32, α=16, dropout=0.05` on `{q,k,v,o}_proj`
LR / epochs	`2e-4` / 1
Effective batch	`2 × 8` accum = 16
Precision	`bf16` + packing
Hardware	1× A100-40GB

LoRA notation. r is the rank of the low-rank update matrices A ∈ ℝ^{d×r}, B ∈ ℝ^{r×d} injected into each target linear; the effective weight delta is ΔW = (α/r) · B A, so α is a scaling coefficient (not a learning rate). dropout is applied to A activations during training. Target modules {q,k,v,o}_proj are the four attention-projection linears in each transformer block.

Stage 3 · Post-SFT trajectories

Because the SFT model is ours, we provisioned an A100 manually and ran inference via plain transformers — no API. This produced the n=64 Pool C trajectories used as the GRPO warm-start corpus and the SFT reference distribution in the CDF (Section 7, blue curve).

Stage 4 · Online GRPO

training/grpo_train.py implements on-policy GRPO (Group Relative Policy Optimisation): K=4 rollouts per prompt with the current policy → within-group reward standardisation → clipped PPO-style loss with a KL penalty against a frozen reference model.

# training/grpo_train.py — the actual update
ratio     = torch.exp(plp - rlp.detach())
unclipped = ratio * adv
clipped   = torch.clamp(ratio, 1 - clip, 1 + clip) * adv
pg_loss   = -torch.min(unclipped, clipped)
kl_loss   = beta * (rlp.detach() - plp)
loss      = (pg_loss + kl_loss).sum() / n_tokens

Where:

Symbol	Meaning
`plp`	Per-token log-probability of the recorded assistant turn under the policy (current, trainable model).
`rlp`	Same per-token log-probability under the reference model (frozen base; `.detach()` blocks gradient).
`ratio = exp(plp − rlp)`	Importance-sampling ratio of policy / reference — equals `1.0` when they agree.
`adv`	The advantage for the segment, computed from the within-group return: `A_i = (R_i − μ_R) / (σ_R + ε)` where `R_i = terminal_reward + r_cross_i`, `μ_R, σ_R` are the mean/stdev of returns inside the K-rollout group, and `ε = 1e-6` for numerical stability.
`clip` (PPO ε)	Trust-region width: `0.2`. Caps how far `ratio` can move before the gradient is clipped.
`pg_loss`	Clipped policy-gradient loss (negative because we minimise).
`beta` (`β`)	KL penalty coefficient: `0.04`. Trades exploration vs. drift from the reference.
`kl_loss`	Per-token forward-KL approximation `β · (rlp − plp)`, pulling the policy toward the reference.
`n_tokens`	Total assistant tokens in the group — normalises so loss magnitude is independent of generation length.

Curriculum:

Stage	Pool	Mode	What gets trained
2	A	`p1_only`	Ops policy only
3	B	`p2_only`	Code policy only (oracle handoff)
4	C	`joint`	Full P1 → P2 with `r_cross` on

Two safety scaffolds in training/variance_gate.py:

Variance gate — Stage 4 doesn't open until ≥4 tasks show stable r_code variance (stdev ≤ 0.15 over 64 samples).
r_cross warmup — linear ramp 0 → 1 over the first 500 Stage-4 steps.

Setting	Value	What it controls
LoRA	`r=16, α=32, dropout=0.05` on `{q,k,v,o}_proj`	Trainable adapter capacity (see Stage 2 box).
Learning rate	`1e-5`	AdamW step size on LoRA params only.
`β` (KL coeff)	`0.04`	Penalty pulling policy toward frozen reference; larger = more conservative.
`clip` (PPO ε)	`0.2`	Width of the trust region in the clipped surrogate.
Group size `K`	`4`	Rollouts per prompt used to compute within-group advantage.
Episodes / task	`64`	Per stage; split across the K-rollout groups.

Stage 5 · Merge

The smallest file in the repo and the one that makes everything deployable:

# merge.py
base_model  = "Qwen/Qwen2.5-7B-Instruct"
lora_model  = "daemongg/qwen2.5-7b-sre-grpo"
output_repo = "Yaswanth-Bolla/qwen-merged"

model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, lora_model)
model = model.merge_and_unload()
model.push_to_hub(output_repo)

The output is a vanilla causal LM that vLLM, TGI, or plain transformers can load with no idea it had adapters.

7 · Results

Figure 1 — Reward distribution (CDF)

Empirical CDF of cumulative reward — lower curve = better (more probability mass at high reward).

Baseline (green dashed, n=80): long left tail; ~40 % of rollouts under 0.75.
SFT (blue, n=64): consistent — fewer catastrophes, modest median.
Posttrained RL (red, n=100): dominates across nearly every quantile, with the steepest climb between 0.4 and 0.75 — that's where GRPO concentrated mass.

Figure 2 — Efficiency curve (reward vs. steps)

Model	Mean reward by ~30 steps	Steps to plateau	σ at plateau
Baseline	~0.20	never within 60 steps	wide
SFT	~0.95	~50 steps	medium
Posttrained RL	~1.59	~25 steps	tight

✦ The operationally meaningful number isn't the +1.10 reward — it's that the post-trained model gets there in half the wall-clock steps. Fewer pages, less time-to-resolution.

Component breakdown — Pool C (oracle-independent grader, n ≈ 100)

Metric	Base	RL	Δ
`mean_final`	0.4495	0.4537	▲ 0.0042
`mean_p1_steps`	16.62	15.75	▼ 0.87
`mean_p2_steps`	5.62	6.50	▲ 0.88
`mean_r_cross`	0.4412	0.4662	▲ 0.025

The per-step grader's mean_final moves only marginally on Pool C — the visible win is in cumulative reward, CDF dominance, and r_cross (+0.025), which is the actual training signal we cared about. The +0.88 P2-steps shift is intentional: the RL model learned to use the code workspace before patching, instead of one-shotting a wrong diff.

Held-out — Pool D (n ≈ 16)

Metric	Base	RL	Δ
`mean_final`	0.5565	0.5284	▼ 0.0281
Pearson r (P2 breadth)	+0.4951	−0.3637	▼ 0.8588

⚠ We're flagging this honestly. On the two compositional held-out scenarios, RL is slightly worse than baseline. The strong negative Pearson on P2 breadth tells us why: the RL model commits to a narrow code search early; on truly novel compounds, the base model's naïve breadth-first browsing is a better strategy. Fix path is in §9.

8 · Ablations

A · `r_cross` on vs. off — the most informative knob

Condition	Δ `mean_final` (FT − Base)	Δ `mean_r_cross`
`r_cross_on`	▲ 0.0256	▲ 0.169
`r_cross_off`	▲ 0.0054	0

Without the counterfactual reward, the fine-tuning gap shrinks ~80 %. Phase 1 has no incentive to produce a useful belief if you don't reward Phase 2 for using it.

B · Stopping behaviour shifts by allocation, not total

The fine-tuned model transitions to Phase 2 0.87 steps earlier and spends 0.88 steps more inside Phase 2. Net step count is roughly flat — but the budget allocation improved. Less dashboard, more code.

C · Source-type contribution

Source removed	Δ `mean_final` (Pool C)
Logs only	▼ 0.04
Metrics only	▼ 0.07
Git log + diffs	▼ 0.13
Mini-repo file tree	▼ 0.18

Code attribution is the single biggest contributor. Take away the repo and the agent loses ~40 % of its lift.

D · Convergence proxy

Metric	Fine-tuned	Base
Early-window mean_final	0.7475	0.6425
Late-window mean_final	0.4255	0.4620

Fine-tuned starts hotter and decays — has memorised some training-distribution heuristics. Consistent with the Pool D regression. This is the clearest place to push next.

9 · Limitations & honest caveats

Pool D regression. RL underperforms base by 0.028 on held-out compounds. Fix: Pool-D-shaped curriculum data + entropy bonus.
Calibration regresses. ECE 0.58 → 0.81 — RL is more confident without being more correct. The BeliefState aux-loss in training/belief_aux_loss.py is the place to wire it back in.
Sample sizes are honest, not heroic. Baseline n=80, SFT n=64, RL n=100; held-out n=16. Take the held-out number as directional.
No code execution. Phase 2 is read-only. Adding a sandboxed pytest action would close the largest fraction of remaining capability gap.
Minimal system prompt. A more elaborate scratchpad/belief-state prompt likely closes the SFT→RL gap further. We'd consider that a positive signal for the environment.

10 · Closing

We set out to answer one question: can a small open model, trained against a faithful incident-response simulator, become competitively useful at SRE triage?

On the training distribution: yes, clearly. On novel compounds: not yet, but the training signal we built (r_cross) and the curriculum that uses it are correctly oriented toward fixing that. And the most durable artefact from this submission isn't the score — it's the stack:

Artefact	Where
OpenEnv environment	`incident_env` (this repo)
Hosted Space	`meta-hf-hackathon-updated-policy.hf.space`
LoRA adapter	`daemongg/qwen2.5-7b-sre-grpo`
Merged model	`Yaswanth-Bolla/qwen-merged`
Trajectories	`sre_*_dataset.jsonl` (in repo)
Training scripts	`sft.py`, `training/grpo_train.py`, `merge.py`