Spaces:

Meta-HF-hackathon
/

updated-policy

Sleeping

App Files Files Community

updated-policy / BLOG.md

srinjoyd

Update BLOG.md

8450748 verified 24 days ago

preview code

raw

history blame contribute delete

24.8 kB

	# Teaching a 7B Model to Be On-Call

	### An OpenEnv benchmark and a four-stage GRPO pipeline that turns Qwen2.5-7B into a working SRE triage agent

	---

	> TL;DR. We built `incident_env` — an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained Qwen2.5-7B-Instruct through a four-stage curriculum (baseline rollouts → LoRA SFT → online GRPO with `r_cross` → merge). The post-trained model reaches a mean cumulative reward of ≈1.59 vs ≈0.49 for the base, at less than half the steps, with tighter variance and dominant CDF across the operating range.


	![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/bNlv5ywRBRCj3Al1BKi8R.png)

	> 🧭 One-line pitch. Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still — memory climbs, alerts cascade, and the obvious symptom is almost never the cause.

	---

	## 1 · Why this benchmark didn't exist yet

	Pick any list of agentic LLM benchmarks today and you'll see two clusters:

	\| Cluster \| Examples \| What they miss \|
	\| --- \| --- \| --- \|
	\| Frozen-repo coding \| SWE-bench, RepoBench, HumanEval \| No evolving system, no observability, no alerts \|
	\| Tool-use chains \| AgentBench, ToolBench, τ-bench \| Plenty of API calls, but no reactive simulator \|

	Neither cluster matches the workflow that consumes the most engineer-hours at any company running real systems: on-call triage. A pager fires. A graph is wrong. Three services look broken but only one is broken. Someone has to triangulate, propose a fix, and identify the offending commit — under SLA pressure, with partial information.

	That gap is exactly what `incident_env` fills.

	> ✦ Capability gap. Today's LLMs can read a static repo. They cannot yet diagnose a system whose state changes while they're looking at it.

	---

	## 2 · Environment at a glance

	`incident_env` is an OpenEnv `Environment` — clean Gym-style `reset()` / `step()` / `state` plus a `/score` endpoint for the oracle-independent grader. Under the hood it is a reactive, partially-observable, two-phase simulator.

	### Topology — seven reactive services

	```
	┌─────────┐ ┌─────┐ ┌────────┐ ┌─────────┐
	│ API GW │───▶│Auth │───▶│ Orders │───▶│ Payment │
	└────┬────┘ └─────┘ └───┬────┘ └────┬────┘
	▼ ▼ ▼
	┌─────────┐ ┌─────────┐ ┌─────────┐
	│ Cache │ │ DB │ │ Queue │
	└─────────┘ └─────────┘ └─────────┘
	```

	Each service has live metric history (CPU, memory, p50/p95/p99 latency, error rate, RPS), structured logs, deploy history, and a `healthy \| degraded \| down` status. Faults propagate along this graph each `tick()`. Restarting a downstream service buys minutes; rolling back the wrong deploy makes things worse.

	### The agent loop


	![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/sLaYeQmysnBDQw-VcmYsp.png)

	Per-step execution is `validate → mutate → tick → observe → reward`. Two facts make the loop interesting:

	1. The observation never exposes `fault_type`, the `is_bad` deploy flag, or any internal simulation state. The agent infers from symptoms.
	2. The action space is hierarchical and masked. `valid_actions[]` is recomputed every step, so illegal actions (e.g. rollback on a service with no deploy history) are flagged with a `-0.05` penalty.


	![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/UVcgYXYA0j-1ZN8RxN3Cd.png)

	---

	## 3 · Two-phase action design (this is the novel bit)

	Most environments give the agent one type of tool. Ours gives it two — and forces a deliberate transition between them.

	```mermaid
	stateDiagram-v2
	[*] --> Phase1
	state Phase1 {
	[*] --> Investigating
	Investigating --> Investigating : view_alerts / query_logs / check_metrics<br/>check_dependencies / check_deploy_history<br/>run_health_check
	Investigating --> Remediating : restart_service / rollback_deploy / scale_service
	Remediating --> Investigating
	Investigating --> Declared : declare_root_cause
	}
	Phase1 --> Phase2 : transition_to_phase2(belief)
	state Phase2 {
	[*] --> Exploring
	Exploring --> Exploring : list_dir / read_file / search_code<br/>get_git_log / get_file_diff
	Exploring --> Patched : propose_patch / declare_no_change
	}
	Patched --> [*]
	Declared --> [*]
	```

	### Phase 1 — ops investigation

	The same tools an SRE has at 3 AM, plus a `transition_to_phase2` control action that hands a structured `BeliefState` over to Phase 2:

	\| Action \| Category \| Purpose \|
	\| --- \| --- \| --- \|
	\| `view_alerts` \| diagnostic \| List firing alerts \|
	\| `query_logs` \| diagnostic \| Filter by service/level/keyword \|
	\| `check_metrics` \| diagnostic \| 30-min time series \|
	\| `check_dependencies` \| diagnostic \| Up/downstream graph \|
	\| `check_deploy_history` \| diagnostic \| Recent deploys \|
	\| `run_health_check` \| diagnostic \| Ping a service \|
	\| `restart_service` \| remediation \| Temporary fix \|
	\| `rollback_deploy` \| remediation \| Real fix if root cause \|
	\| `scale_service` \| remediation \| More replicas \|
	\| `declare_root_cause` \| terminal \| Diagnosis string \|
	\| `transition_to_phase2` \| control \| Hand off to code attribution \|

	### Phase 2 — code attribution

	When a scenario has a `code_context`, the env spins up a sandboxed `CodeWorkspace` over a bundled mini-repo:

	```
	snapshots/<scenario>/
	tree/ ← actual source files
	git_log.json ← commits (sha, author, date, msg, files)
	diffs/<sha>.patch ← unified diff per commit
	```

	Five new actions appear, all sandboxed (no `..`, no symlinks, no real subprocess):

	\| Action \| What it returns \|
	\| --- \| --- \|
	\| `list_dir` \| files + subdirs at a relative path \|
	\| `read_file` \| up to 64 KB of file contents \|
	\| `search_code` \| grep across the tree, capped at 50 hits \|
	\| `get_git_log` \| commit metadata for a path \|
	\| `get_file_diff` \| unified diff for `(commit_sha, path)` \|
	\| `propose_patch` \| terminal — submit a unified diff \|
	\| `declare_no_change` \| terminal — for spurious-issue scenarios \|

	> ✦ Why two phases? Real triage is two phases. Mixing them in one action soup forces the agent to learn a strategy: gather enough Phase-1 evidence to make Phase-2 cheap, but don't dawdle. This single design decision is what gives `r_cross` (Section 5) something meaningful to reward.

	---

	## 4 · Reward design — two layers, kept separate by design

	```
	┌───────────────────────────────────────────────────────────────┐
	│ LAYER 1 · Per-step shaped reward (TRAINING ONLY) │
	│ peeks at hidden state to give a useful gradient │
	├───────────────────────────────────────────────────────────────┤
	│ diagnostic on involved svc +0.15 │
	│ diagnostic on uninvolved svc +0.05 │
	│ remediation on root-cause svc +0.30 │
	│ correct root cause declaration +0.40 │
	│ per-step efficiency cost −0.02 │
	│ repeat / invalid −0.05 │
	│ wrong-target remediation −0.15 │
	└───────────────────────────────────────────────────────────────┘
	│
	▼
	┌───────────────────────────────────────────────────────────────┐
	│ LAYER 2 · Oracle-independent grader (EVALUATION) │
	│ sees only the trajectory + declared patch │
	├───────────────────────────────────────────────────────────────┤
	│ p1_rca 25 % keyword/AST match │
	│ p1_efficiency 15 % fewer steps to declare │
	│ patch_quality 35 % file overlap + AST + syntax │
	│ no_change_detection 25 % spurious-issue scenarios │
	│ p2_efficiency 25 % used when valid issue │
	└───────────────────────────────────────────────────────────────┘
	```

	Patch quality has three tiers: file overlap (Jaccard), AST-level hunk similarity, and syntax validity — none of which read hidden state. Saved trajectories can be re-graded months later from a JSONL file alone.

	### `r_cross` — the counterfactual that makes joint training work

	```math
	r_cross(τ) = max(0, r_code(τ_2 \| context(τ_1)) − r_code(τ_2 \| ∅))
	```

	Where:

	\| Symbol \| Meaning \|
	\| --- \| --- \|
	\| `τ` (tau) \| A full episode trajectory (a sequence of observation–action–reward steps). \|
	\| `τ_1` \| The Phase-1 sub-trajectory of `τ` (ops investigation steps only). \|
	\| `τ_2` \| The Phase-2 sub-trajectory of `τ` (code-attribution steps only). \|
	\| `r_code(...)` \| The Phase-2 grader score (patch quality + no-change detection), in `[0, 1]`. \|
	\| `context(τ_1)` \| The structured belief handed off from Phase 1 to Phase 2 (suspected service, fault class, confidences, evidence gaps). \|
	\| `∅` (null context) \| An empty handoff — Phase 2 starts with no Phase-1 evidence. Score measured separately on Pool B. \|
	\| `max(0, ·)` \| Clamp to non-negative; we never punish Phase 1 for inherently hard bugs. \|
	\| `−` \| Counterfactual difference: how much did Phase 1 actually help? \|

	In English: how much did Phase 1's investigation actually help the code agent vs. starting from a null context? `r_cross` is what makes the joint training signal meaningful — without it, Phase 1 has no incentive to produce a useful handoff, only a plausible one. We will show in the ablations that turning `r_cross` off collapses ~80 % of the lift.

	---

	## 5 ·Scenario flavours

	\| Task \| Hidden lesson \|
	\| --- \| --- \|
	\| `memory_leak` \| Single service, noisy metric — restart only buys minutes \|
	\| `cascading_failure` \| Loud services aren't the cause — must walk the dep graph \|
	\| `distributed_deadlock` \| Three remediation actions, in a specific order \|
	\| `aliased_fault` \| Queue worker leaks like a memory leak — symptoms alias \|
	\| `severity_inversion` \| SEV1 page, two-line fix in `orders/auth_client.py` \|
	\| `confidence_inversion` \| Loud alerts on the wrong service; real bug is a lock-ordering issue \|
	\| `info_ordering` \| Decisive evidence shows up late — early committers lose \|
	\| `circuit_breaker_noop` \| Spurious issue; the right answer is `declare_no_change` \|
	\| `heldout_*` (×2) \| Compounds of the above; never seen during training \|

	---

	## 6 · The training pipeline

	### Architecture — what GRPO is actually optimising

	Before the stage-by-stage detail, here is the architectural view: a three-level hierarchy with an orchestrator routing policy on top, two specialised subagents below it, and segment-level GRPO with cross-phase reward propagation underneath both.

	![Hierarchical RL architecture — orchestrator + specialized subagents + segment-level GRPO with r_cross](./assets/hierarchical_rl_architecture.svg)

	Three things to notice in this picture:

	- The orchestrator owns the stopping criterion. Deciding when Phase 1 has gathered enough evidence to hand off is a learned policy, not a rule. The orchestrator emits a structured `BeliefState` (`suspected_service`, `fault_class`, confidences, `evidence_gaps`) at every transition decision — making the criterion auditable and supervisable.
	- The subagents are specialised but share weights. P1 (ops) and P2 (code) are the same Qwen2.5-7B-Instruct LoRA adapter prompted differently per phase. We train them in pool-isolated stages first, then jointly with `r_cross` switched on.
	- The reward signal is segment-level, not trajectory-level. Episodes are 8–16 k tokens; one scalar reward over the whole thing dilutes credit. Each phase becomes its own GRPO group; `r_cross` is added to the Phase-1 group return with stop-gradient on the Phase-2 path (`training/segment_grpo.py`). That single architectural choice is what lets joint training avoid poisoning Phase-1 gradients with Phase-2 noise.

	The big picture (rendered SVG at the top of the post) shows the data flow Base → SFT → GRPO → Merge. The diagram above shows the gradient flow that lives inside the GRPO box. Stage-by-stage detail below — kept tight.

	### Stage 1 · Baseline rollouts

	`sre_finetune_collector.py` drives the deployed environment over the HuggingFace Inference API (`Qwen/Qwen2.5-7B-Instruct:fastest`). Episodes are sampled across all four pools with weights `A=0.35, B=0.20, C=0.35, D=0.10`. Negative-reward episodes are kept as hard negatives — there's no quality filter on rollouts.

	Three artefacts written incrementally:

	```
	sre_raw_trajectories.jsonl — full episode + score breakdown
	sre_sft_dataset.jsonl — one row per (observation, action) step
	sre_grpo_dataset.jsonl — (prompt, chosen, rejected) preference pairs
	```

	### Stage 2 · LoRA SFT (TRL)

	Built on TRL's `SFTTrainer` with PEFT/LoRA — the minimum-requirements training stack named in RULES.md.

	```python
	# sft.py
	trainer = SFTTrainer(
	model = model, # Qwen2.5-7B-Instruct
	args = training_args, # bf16, packing on
	train_dataset = dataset[script_args.dataset_train_split],
	eval_dataset = dataset[script_args.dataset_test_split],
	peft_config = get_peft_config(model_args), # LoRA: r=32, α=16
	)
	trainer.train()
	```

	\| Setting \| Value \|
	\| --- \| --- \|
	\| Base \| `Qwen/Qwen2.5-7B-Instruct` \|
	\| LoRA \| `r=32, α=16, dropout=0.05` on `{q,k,v,o}_proj` \|
	\| LR / epochs \| `2e-4` / 1 \|
	\| Effective batch \| `2 × 8` accum = 16 \|
	\| Precision \| `bf16` + packing \|
	\| Hardware \| 1× A100-40GB \|

	> LoRA notation. `r` is the rank of the low-rank update matrices `A ∈ ℝ^{d×r}, B ∈ ℝ^{r×d}` injected into each target linear; the effective weight delta is `ΔW = (α/r) · B A`, so `α` is a scaling coefficient (not a learning rate). `dropout` is applied to `A` activations during training. Target modules `{q,k,v,o}_proj` are the four attention-projection linears in each transformer block.

	### Stage 3 · Post-SFT trajectories

	Because the SFT model is ours, we provisioned an A100 manually and ran inference via plain `transformers` — no API. This produced the n=64 Pool C trajectories used as the GRPO warm-start corpus and the SFT reference distribution in the CDF (Section 7, blue curve).

	### Stage 4 · Online GRPO

	`training/grpo_train.py` implements on-policy GRPO (Group Relative Policy Optimisation): K=4 rollouts per prompt with the current policy → within-group reward standardisation → clipped PPO-style loss with a KL penalty against a frozen reference model.

	```python
	# training/grpo_train.py — the actual update
	ratio = torch.exp(plp - rlp.detach())
	unclipped = ratio * adv
	clipped = torch.clamp(ratio, 1 - clip, 1 + clip) * adv
	pg_loss = -torch.min(unclipped, clipped)
	kl_loss = beta * (rlp.detach() - plp)
	loss = (pg_loss + kl_loss).sum() / n_tokens
	```

	Where:

	\| Symbol \| Meaning \|
	\| --- \| --- \|
	\| `plp` \| Per-token log-probability of the recorded assistant turn under the policy (current, trainable model). \|
	\| `rlp` \| Same per-token log-probability under the reference model (frozen base; `.detach()` blocks gradient). \|
	\| `ratio = exp(plp − rlp)` \| Importance-sampling ratio of policy / reference — equals `1.0` when they agree. \|
	\| `adv` \| The advantage for the segment, computed from the within-group return: `A_i = (R_i − μ_R) / (σ_R + ε)` where `R_i = terminal_reward + r_cross_i`, `μ_R, σ_R` are the mean/stdev of returns inside the K-rollout group, and `ε = 1e-6` for numerical stability. \|
	\| `clip` (PPO ε) \| Trust-region width: `0.2`. Caps how far `ratio` can move before the gradient is clipped. \|
	\| `pg_loss` \| Clipped policy-gradient loss (negative because we minimise). \|
	\| `beta` (`β`) \| KL penalty coefficient: `0.04`. Trades exploration vs. drift from the reference. \|
	\| `kl_loss` \| Per-token forward-KL approximation `β · (rlp − plp)`, pulling the policy toward the reference. \|
	\| `n_tokens` \| Total assistant tokens in the group — normalises so loss magnitude is independent of generation length. \|

	Curriculum:

	\| Stage \| Pool \| Mode \| What gets trained \|
	\| --- \| --- \| --- \| --- \|
	\| 2 \| A \| `p1_only` \| Ops policy only \|
	\| 3 \| B \| `p2_only` \| Code policy only (oracle handoff) \|
	\| 4 \| C \| `joint` \| Full P1 → P2 with `r_cross` on \|

	Two safety scaffolds in `training/variance_gate.py`:

	- Variance gate — Stage 4 doesn't open until ≥4 tasks show stable `r_code` variance (stdev ≤ 0.15 over 64 samples).
	- `r_cross` warmup — linear ramp 0 → 1 over the first 500 Stage-4 steps.

	\| Setting \| Value \| What it controls \|
	\| --- \| --- \| --- \|
	\| LoRA \| `r=16, α=32, dropout=0.05` on `{q,k,v,o}_proj` \| Trainable adapter capacity (see Stage 2 box). \|
	\| Learning rate \| `1e-5` \| AdamW step size on LoRA params only. \|
	\| `β` (KL coeff) \| `0.04` \| Penalty pulling policy toward frozen reference; larger = more conservative. \|
	\| `clip` (PPO ε) \| `0.2` \| Width of the trust region in the clipped surrogate. \|
	\| Group size `K` \| `4` \| Rollouts per prompt used to compute within-group advantage. \|
	\| Episodes / task \| `64` \| Per stage; split across the K-rollout groups. \|

	### Stage 5 · Merge

	The smallest file in the repo and the one that makes everything deployable:

	```python
	# merge.py
	base_model = "Qwen/Qwen2.5-7B-Instruct"
	lora_model = "daemongg/qwen2.5-7b-sre-grpo"
	output_repo = "Yaswanth-Bolla/qwen-merged"

	model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.float16, device_map="auto")
	model = PeftModel.from_pretrained(model, lora_model)
	model = model.merge_and_unload()
	model.push_to_hub(output_repo)
	```

	The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can load with no idea it had adapters.

	---

	## 7 · Results

	### Figure 1 — Reward distribution (CDF)


	![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/N79LUO_eo8nExgK5xhArc.png)


	> Empirical CDF of cumulative reward — lower curve = better (more probability mass at high reward).

	- Baseline (green dashed, n=80): long left tail; ~40 % of rollouts under 0.75.
	- SFT (blue, n=64): consistent — fewer catastrophes, modest median.
	- Posttrained RL (red, n=100): dominates across nearly every quantile, with the steepest climb between 0.4 and 0.75 — that's where GRPO concentrated mass.

	### Figure 2 — Efficiency curve (reward vs. steps)


	![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/zTQyShUBp5jZ76Z_6rA_-.png)

	\| Model \| Mean reward by ~30 steps \| Steps to plateau \| σ at plateau \|
	\| --- \| --- \| --- \| --- \|
	\| Baseline \| ~0.20 \| never within 60 steps \| wide \|
	\| SFT \| ~0.95 \| ~50 steps \| medium \|
	\| Posttrained RL \| ~1.59 \| ~25 steps \| tight \|

	> ✦ *The operationally meaningful number isn't the +1.10 reward — it's that the post-trained model gets there in half the wall-clock steps.* Fewer pages, less time-to-resolution.

	### Component breakdown — Pool C (oracle-independent grader, n ≈ 100)

	\| Metric \| Base \| RL \| Δ \|
	\| --- \| --- \| --- \| --- \|
	\| `mean_final` \| 0.4495 \| 0.4537 \| ▲ 0.0042 \|
	\| `mean_p1_steps` \| 16.62 \| 15.75 \| ▼ 0.87 \|
	\| `mean_p2_steps` \| 5.62 \| 6.50 \| ▲ 0.88 \|
	\| `mean_r_cross` \| 0.4412 \| 0.4662 \| ▲ 0.025 \|

	> The per-step grader's `mean_final` moves only marginally on Pool C — the visible win is in cumulative reward, CDF dominance, and `r_cross` (+0.025), which is the actual training signal we cared about. The +0.88 P2-steps shift is intentional: the RL model learned to use the code workspace before patching, instead of one-shotting a wrong diff.

	### Held-out — Pool D (n ≈ 16)

	\| Metric \| Base \| RL \| Δ \|
	\| --- \| --- \| --- \| --- \|
	\| `mean_final` \| 0.5565 \| 0.5284 \| ▼ 0.0281 \|
	\| Pearson r (P2 breadth) \| +0.4951 \| −0.3637 \| ▼ 0.8588 \|

	> ⚠ We're flagging this honestly. On the two compositional held-out scenarios, RL is slightly worse than baseline. The strong negative Pearson on P2 breadth tells us why: the RL model commits to a narrow code search early; on truly novel compounds, the base model's naïve breadth-first browsing is a better strategy. Fix path is in §9.

	---

	## 8 · Ablations

	### A · `r_cross` on vs. off — the most informative knob

	\| Condition \| Δ `mean_final` (FT − Base) \| Δ `mean_r_cross` \|
	\| --- \| --- \| --- \|
	\| `r_cross_on` \| ▲ 0.0256 \| ▲ 0.169 \|
	\| `r_cross_off` \| ▲ 0.0054 \| 0 \|

	> Without the counterfactual reward, the fine-tuning gap shrinks ~80 %. Phase 1 has no incentive to produce a useful belief if you don't reward Phase 2 for using it.

	### B · Stopping behaviour shifts by allocation, not total

	The fine-tuned model transitions to Phase 2 0.87 steps earlier and spends 0.88 steps more inside Phase 2. Net step count is roughly flat — but the budget allocation improved. Less dashboard, more code.

	### C · Source-type contribution

	\| Source removed \| Δ `mean_final` (Pool C) \|
	\| --- \| --- \|
	\| Logs only \| ▼ 0.04 \|
	\| Metrics only \| ▼ 0.07 \|
	\| Git log + diffs \| ▼ 0.13 \|
	\| Mini-repo file tree \| ▼ 0.18 \|

	> Code attribution is the single biggest contributor. Take away the repo and the agent loses ~40 % of its lift.

	### D · Convergence proxy

	\| Metric \| Fine-tuned \| Base \|
	\| --- \| --- \| --- \|
	\| Early-window mean_final \| 0.7475 \| 0.6425 \|
	\| Late-window mean_final \| 0.4255 \| 0.4620 \|

	> Fine-tuned starts hotter and decays — has memorised some training-distribution heuristics. Consistent with the Pool D regression. This is the clearest place to push next.

	---

	## 9 · Limitations & honest caveats

	- Pool D regression. RL underperforms base by 0.028 on held-out compounds. Fix: Pool-D-shaped curriculum data + entropy bonus.
	- Calibration regresses. ECE 0.58 → 0.81 — RL is more confident without being more correct. The `BeliefState` aux-loss in `training/belief_aux_loss.py` is the place to wire it back in.
	- Sample sizes are honest, not heroic. Baseline n=80, SFT n=64, RL n=100; held-out n=16. Take the held-out number as directional.
	- No code execution. Phase 2 is read-only. Adding a sandboxed `pytest` action would close the largest fraction of remaining capability gap.
	- Minimal system prompt. A more elaborate scratchpad/belief-state prompt likely closes the SFT→RL gap further. We'd consider that a positive signal for the environment.

	---

	## 10 · Closing

	We set out to answer one question: can a small open model, trained against a faithful incident-response simulator, become competitively useful at SRE triage?

	On the training distribution: yes, clearly. On novel compounds: not yet, but the training signal we built (`r_cross`) and the curriculum that uses it are correctly oriented toward fixing that. And the most durable artefact from this submission isn't the score — it's the stack:

	\| Artefact \| Where \|
	\| --- \| --- \|
	\| OpenEnv environment \| `incident_env` (this repo) \|
	\| Hosted Space \| `meta-hf-hackathon-updated-policy.hf.space` \|
	\| LoRA adapter \| `daemongg/qwen2.5-7b-sre-grpo` \|
	\| Merged model \| `Yaswanth-Bolla/qwen-merged` \|
	\| Trajectories \| `sre_*_dataset.jsonl` (in repo) \|
	\| Training scripts \| `sft.py`, `training/grpo_train.py`, `merge.py` \|


	---