Spaces:
Sleeping
Sleeping
| # Teaching a 7B Model to Be On-Call | |
| ### An OpenEnv benchmark and a four-stage GRPO pipeline that turns Qwen2.5-7B into a working SRE triage agent | |
| --- | |
| > **TL;DR.** We built `incident_env` — an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts → LoRA SFT → online GRPO with `r_cross` → merge). The post-trained model reaches a **mean cumulative reward of ≈1.59 vs ≈0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range. | |
|  | |
| > 🧭 **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still — memory climbs, alerts cascade, and the obvious symptom is almost never the cause.* | |
| --- | |
| ## 1 · Why this benchmark didn't exist yet | |
| Pick any list of agentic LLM benchmarks today and you'll see two clusters: | |
| | Cluster | Examples | What they miss | | |
| | --- | --- | --- | | |
| | **Frozen-repo coding** | SWE-bench, RepoBench, HumanEval | No evolving system, no observability, no alerts | | |
| | **Tool-use chains** | AgentBench, ToolBench, τ-bench | Plenty of API calls, but no reactive simulator | | |
| Neither cluster matches the workflow that consumes the most engineer-hours at any company running real systems: **on-call triage**. A pager fires. A graph is wrong. Three services look broken but only one *is* broken. Someone has to triangulate, propose a fix, and identify the offending commit — under SLA pressure, with partial information. | |
| That gap is exactly what `incident_env` fills. | |
| > ✦ **Capability gap.** Today's LLMs can read a static repo. They cannot yet diagnose a system whose state changes while they're looking at it. | |
| --- | |
| ## 2 · Environment at a glance | |
| `incident_env` is an OpenEnv `Environment` — clean Gym-style `reset()` / `step()` / `state` plus a `/score` endpoint for the oracle-independent grader. Under the hood it is a **reactive, partially-observable, two-phase** simulator. | |
| ### Topology — seven reactive services | |
| ``` | |
| ┌─────────┐ ┌─────┐ ┌────────┐ ┌─────────┐ | |
| │ API GW │───▶│Auth │───▶│ Orders │───▶│ Payment │ | |
| └────┬────┘ └─────┘ └───┬────┘ └────┬────┘ | |
| ▼ ▼ ▼ | |
| ┌─────────┐ ┌─────────┐ ┌─────────┐ | |
| │ Cache │ │ DB │ │ Queue │ | |
| └─────────┘ └─────────┘ └─────────┘ | |
| ``` | |
| Each service has live metric history (CPU, memory, p50/p95/p99 latency, error rate, RPS), structured logs, deploy history, and a `healthy | degraded | down` status. Faults propagate along this graph each `tick()`. Restarting a downstream service buys minutes; rolling back the wrong deploy makes things worse. | |
| ### The agent loop | |
|  | |
| Per-step execution is `validate → mutate → tick → observe → reward`. Two facts make the loop interesting: | |
| 1. The observation **never** exposes `fault_type`, the `is_bad` deploy flag, or any internal simulation state. The agent infers from symptoms. | |
| 2. The action space is **hierarchical and masked**. `valid_actions[]` is recomputed every step, so illegal actions (e.g. rollback on a service with no deploy history) are flagged with a `-0.05` penalty. | |
|  | |
| --- | |
| ## 3 · Two-phase action design (this is the novel bit) | |
| Most environments give the agent one type of tool. Ours gives it two — and forces a deliberate transition between them. | |
| ```mermaid | |
| stateDiagram-v2 | |
| [*] --> Phase1 | |
| state Phase1 { | |
| [*] --> Investigating | |
| Investigating --> Investigating : view_alerts / query_logs / check_metrics<br/>check_dependencies / check_deploy_history<br/>run_health_check | |
| Investigating --> Remediating : restart_service / rollback_deploy / scale_service | |
| Remediating --> Investigating | |
| Investigating --> Declared : declare_root_cause | |
| } | |
| Phase1 --> Phase2 : transition_to_phase2(belief) | |
| state Phase2 { | |
| [*] --> Exploring | |
| Exploring --> Exploring : list_dir / read_file / search_code<br/>get_git_log / get_file_diff | |
| Exploring --> Patched : propose_patch / declare_no_change | |
| } | |
| Patched --> [*] | |
| Declared --> [*] | |
| ``` | |
| ### Phase 1 — ops investigation | |
| The same tools an SRE has at 3 AM, plus a `transition_to_phase2` control action that hands a structured `BeliefState` over to Phase 2: | |
| | Action | Category | Purpose | | |
| | --- | --- | --- | | |
| | `view_alerts` | diagnostic | List firing alerts | | |
| | `query_logs` | diagnostic | Filter by service/level/keyword | | |
| | `check_metrics` | diagnostic | 30-min time series | | |
| | `check_dependencies` | diagnostic | Up/downstream graph | | |
| | `check_deploy_history` | diagnostic | Recent deploys | | |
| | `run_health_check` | diagnostic | Ping a service | | |
| | `restart_service` | remediation | Temporary fix | | |
| | `rollback_deploy` | remediation | Real fix if root cause | | |
| | `scale_service` | remediation | More replicas | | |
| | `declare_root_cause` | terminal | Diagnosis string | | |
| | `transition_to_phase2` | control | Hand off to code attribution | | |
| ### Phase 2 — code attribution | |
| When a scenario has a `code_context`, the env spins up a sandboxed `CodeWorkspace` over a bundled mini-repo: | |
| ``` | |
| snapshots/<scenario>/ | |
| tree/ ← actual source files | |
| git_log.json ← commits (sha, author, date, msg, files) | |
| diffs/<sha>.patch ← unified diff per commit | |
| ``` | |
| Five new actions appear, all sandboxed (no `..`, no symlinks, no real subprocess): | |
| | Action | What it returns | | |
| | --- | --- | | |
| | `list_dir` | files + subdirs at a relative path | | |
| | `read_file` | up to 64 KB of file contents | | |
| | `search_code` | grep across the tree, capped at 50 hits | | |
| | `get_git_log` | commit metadata for a path | | |
| | `get_file_diff` | unified diff for `(commit_sha, path)` | | |
| | `propose_patch` | terminal — submit a unified diff | | |
| | `declare_no_change` | terminal — for spurious-issue scenarios | | |
| > ✦ **Why two phases?** Real triage *is* two phases. Mixing them in one action soup forces the agent to learn a strategy: gather enough Phase-1 evidence to make Phase-2 cheap, but don't dawdle. This single design decision is what gives `r_cross` (Section 5) something meaningful to reward. | |
| --- | |
| ## 4 · Reward design — two layers, kept separate by design | |
| ``` | |
| ┌───────────────────────────────────────────────────────────────┐ | |
| │ LAYER 1 · Per-step shaped reward (TRAINING ONLY) │ | |
| │ peeks at hidden state to give a useful gradient │ | |
| ├───────────────────────────────────────────────────────────────┤ | |
| │ diagnostic on involved svc +0.15 │ | |
| │ diagnostic on uninvolved svc +0.05 │ | |
| │ remediation on root-cause svc +0.30 │ | |
| │ correct root cause declaration +0.40 │ | |
| │ per-step efficiency cost −0.02 │ | |
| │ repeat / invalid −0.05 │ | |
| │ wrong-target remediation −0.15 │ | |
| └───────────────────────────────────────────────────────────────┘ | |
| │ | |
| ▼ | |
| ┌───────────────────────────────────────────────────────────────┐ | |
| │ LAYER 2 · Oracle-independent grader (EVALUATION) │ | |
| │ sees only the trajectory + declared patch │ | |
| ├───────────────────────────────────────────────────────────────┤ | |
| │ p1_rca 25 % keyword/AST match │ | |
| │ p1_efficiency 15 % fewer steps to declare │ | |
| │ patch_quality 35 % file overlap + AST + syntax │ | |
| │ no_change_detection 25 % spurious-issue scenarios │ | |
| │ p2_efficiency 25 % used when valid issue │ | |
| └───────────────────────────────────────────────────────────────┘ | |
| ``` | |
| Patch quality has three tiers: file overlap (Jaccard), AST-level hunk similarity, and syntax validity — none of which read hidden state. Saved trajectories can be re-graded months later from a JSONL file alone. | |
| ### `r_cross` — the counterfactual that makes joint training work | |
| ```math | |
| r_cross(τ) = max(0, r_code(τ_2 | context(τ_1)) − r_code(τ_2 | ∅)) | |
| ``` | |
| **Where:** | |
| | Symbol | Meaning | | |
| | --- | --- | | |
| | `τ` (tau) | A full episode trajectory (a sequence of observation–action–reward steps). | | |
| | `τ_1` | The Phase-1 sub-trajectory of `τ` (ops investigation steps only). | | |
| | `τ_2` | The Phase-2 sub-trajectory of `τ` (code-attribution steps only). | | |
| | `r_code(...)` | The Phase-2 grader score (patch quality + no-change detection), in `[0, 1]`. | | |
| | `context(τ_1)` | The structured belief handed off from Phase 1 to Phase 2 (suspected service, fault class, confidences, evidence gaps). | | |
| | `∅` (null context) | An empty handoff — Phase 2 starts with no Phase-1 evidence. Score measured separately on Pool B. | | |
| | `max(0, ·)` | Clamp to non-negative; we never *punish* Phase 1 for inherently hard bugs. | | |
| | `−` | Counterfactual difference: *how much did Phase 1 actually help?* | | |
| In English: *how much did Phase 1's investigation actually help the code agent vs. starting from a null context?* `r_cross` is what makes the joint training signal meaningful — without it, Phase 1 has no incentive to produce a *useful* handoff, only a *plausible* one. We will show in the ablations that turning `r_cross` off collapses ~80 % of the lift. | |
| --- | |
| ## 5 ·Scenario flavours | |
| | Task | Hidden lesson | | |
| | --- | --- | | |
| | `memory_leak` | Single service, noisy metric — restart only buys minutes | | |
| | `cascading_failure` | Loud services aren't the cause — must walk the dep graph | | |
| | `distributed_deadlock` | Three remediation actions, in a specific order | | |
| | `aliased_fault` | Queue worker leaks like a memory leak — symptoms alias | | |
| | `severity_inversion` | SEV1 page, two-line fix in `orders/auth_client.py` | | |
| | `confidence_inversion` | Loud alerts on the wrong service; real bug is a lock-ordering issue | | |
| | `info_ordering` | Decisive evidence shows up *late* — early committers lose | | |
| | `circuit_breaker_noop` | Spurious issue; the right answer is `declare_no_change` | | |
| | `heldout_*` (×2) | Compounds of the above; never seen during training | | |
| --- | |
| ## 6 · The training pipeline | |
| ### Architecture — what GRPO is actually optimising | |
| Before the stage-by-stage detail, here is the architectural view: a **three-level hierarchy** with an orchestrator routing policy on top, two specialised subagents below it, and segment-level GRPO with cross-phase reward propagation underneath both. | |
|  | |
| Three things to notice in this picture: | |
| - **The orchestrator owns the stopping criterion.** Deciding *when* Phase 1 has gathered enough evidence to hand off is a learned policy, not a rule. The orchestrator emits a structured `BeliefState` (`suspected_service`, `fault_class`, confidences, `evidence_gaps`) at every transition decision — making the criterion auditable and supervisable. | |
| - **The subagents are specialised but share weights.** P1 (ops) and P2 (code) are the same Qwen2.5-7B-Instruct LoRA adapter prompted differently per phase. We train them in pool-isolated stages first, then jointly with `r_cross` switched on. | |
| - **The reward signal is segment-level, not trajectory-level.** Episodes are 8–16 k tokens; one scalar reward over the whole thing dilutes credit. Each phase becomes its own GRPO group; `r_cross` is added to the Phase-1 group return *with stop-gradient on the Phase-2 path* (`training/segment_grpo.py`). That single architectural choice is what lets joint training avoid poisoning Phase-1 gradients with Phase-2 noise. | |
| The big picture (rendered SVG at the top of the post) shows the *data* flow Base → SFT → GRPO → Merge. The diagram above shows the *gradient* flow that lives inside the GRPO box. Stage-by-stage detail below — kept tight. | |
| ### Stage 1 · Baseline rollouts | |
| `sre_finetune_collector.py` drives the deployed environment over the **HuggingFace Inference API** (`Qwen/Qwen2.5-7B-Instruct:fastest`). Episodes are sampled across all four pools with weights `A=0.35, B=0.20, C=0.35, D=0.10`. **Negative-reward episodes are kept** as hard negatives — there's no quality filter on rollouts. | |
| Three artefacts written incrementally: | |
| ``` | |
| sre_raw_trajectories.jsonl — full episode + score breakdown | |
| sre_sft_dataset.jsonl — one row per (observation, action) step | |
| sre_grpo_dataset.jsonl — (prompt, chosen, rejected) preference pairs | |
| ``` | |
| ### Stage 2 · LoRA SFT (TRL) | |
| Built on TRL's `SFTTrainer` with PEFT/LoRA — the minimum-requirements training stack named in RULES.md. | |
| ```python | |
| # sft.py | |
| trainer = SFTTrainer( | |
| model = model, # Qwen2.5-7B-Instruct | |
| args = training_args, # bf16, packing on | |
| train_dataset = dataset[script_args.dataset_train_split], | |
| eval_dataset = dataset[script_args.dataset_test_split], | |
| peft_config = get_peft_config(model_args), # LoRA: r=32, α=16 | |
| ) | |
| trainer.train() | |
| ``` | |
| | Setting | Value | | |
| | --- | --- | | |
| | Base | `Qwen/Qwen2.5-7B-Instruct` | | |
| | LoRA | `r=32, α=16, dropout=0.05` on `{q,k,v,o}_proj` | | |
| | LR / epochs | `2e-4` / 1 | | |
| | Effective batch | `2 × 8` accum = 16 | | |
| | Precision | `bf16` + packing | | |
| | Hardware | 1× A100-40GB | | |
| > **LoRA notation.** `r` is the **rank** of the low-rank update matrices `A ∈ ℝ^{d×r}, B ∈ ℝ^{r×d}` injected into each target linear; the effective weight delta is `ΔW = (α/r) · B A`, so `α` is a **scaling coefficient** (not a learning rate). `dropout` is applied to `A` activations during training. Target modules `{q,k,v,o}_proj` are the four attention-projection linears in each transformer block. | |
| ### Stage 3 · Post-SFT trajectories | |
| Because the SFT model is *ours*, we provisioned an A100 manually and ran inference via plain `transformers` — no API. This produced the **n=64** Pool C trajectories used as the GRPO warm-start corpus and the SFT reference distribution in the CDF (Section 7, blue curve). | |
| ### Stage 4 · Online GRPO | |
| `training/grpo_train.py` implements **on-policy GRPO** (Group Relative Policy Optimisation): K=4 rollouts per prompt with the current policy → within-group reward standardisation → clipped PPO-style loss with a KL penalty against a frozen reference model. | |
| ```python | |
| # training/grpo_train.py — the actual update | |
| ratio = torch.exp(plp - rlp.detach()) | |
| unclipped = ratio * adv | |
| clipped = torch.clamp(ratio, 1 - clip, 1 + clip) * adv | |
| pg_loss = -torch.min(unclipped, clipped) | |
| kl_loss = beta * (rlp.detach() - plp) | |
| loss = (pg_loss + kl_loss).sum() / n_tokens | |
| ``` | |
| **Where:** | |
| | Symbol | Meaning | | |
| | --- | --- | | |
| | `plp` | Per-token **log-probability** of the recorded assistant turn under the **policy** (current, trainable model). | | |
| | `rlp` | Same per-token log-probability under the **reference** model (frozen base; `.detach()` blocks gradient). | | |
| | `ratio = exp(plp − rlp)` | Importance-sampling ratio of policy / reference — equals `1.0` when they agree. | | |
| | `adv` | The **advantage** for the segment, computed from the within-group return: `A_i = (R_i − μ_R) / (σ_R + ε)` where `R_i = terminal_reward + r_cross_i`, `μ_R, σ_R` are the mean/stdev of returns inside the K-rollout group, and `ε = 1e-6` for numerical stability. | | |
| | `clip` (PPO ε) | Trust-region width: `0.2`. Caps how far `ratio` can move before the gradient is clipped. | | |
| | `pg_loss` | Clipped policy-gradient loss (negative because we minimise). | | |
| | `beta` (`β`) | KL penalty coefficient: `0.04`. Trades exploration vs. drift from the reference. | | |
| | `kl_loss` | Per-token forward-KL approximation `β · (rlp − plp)`, pulling the policy toward the reference. | | |
| | `n_tokens` | Total assistant tokens in the group — normalises so loss magnitude is independent of generation length. | | |
| Curriculum: | |
| | Stage | Pool | Mode | What gets trained | | |
| | --- | --- | --- | --- | | |
| | 2 | A | `p1_only` | Ops policy only | | |
| | 3 | B | `p2_only` | Code policy only (oracle handoff) | | |
| | 4 | C | `joint` | Full P1 → P2 with `r_cross` on | | |
| Two safety scaffolds in `training/variance_gate.py`: | |
| - **Variance gate** — Stage 4 doesn't open until ≥4 tasks show stable `r_code` variance (stdev ≤ 0.15 over 64 samples). | |
| - **`r_cross` warmup** — linear ramp 0 → 1 over the first 500 Stage-4 steps. | |
| | Setting | Value | What it controls | | |
| | --- | --- | --- | | |
| | LoRA | `r=16, α=32, dropout=0.05` on `{q,k,v,o}_proj` | Trainable adapter capacity (see Stage 2 box). | | |
| | Learning rate | `1e-5` | AdamW step size on LoRA params only. | | |
| | `β` (KL coeff) | `0.04` | Penalty pulling policy toward frozen reference; larger = more conservative. | | |
| | `clip` (PPO ε) | `0.2` | Width of the trust region in the clipped surrogate. | | |
| | Group size `K` | `4` | Rollouts per prompt used to compute within-group advantage. | | |
| | Episodes / task | `64` | Per stage; split across the K-rollout groups. | | |
| ### Stage 5 · Merge | |
| The smallest file in the repo and the one that makes everything deployable: | |
| ```python | |
| # merge.py | |
| base_model = "Qwen/Qwen2.5-7B-Instruct" | |
| lora_model = "daemongg/qwen2.5-7b-sre-grpo" | |
| output_repo = "Yaswanth-Bolla/qwen-merged" | |
| model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.float16, device_map="auto") | |
| model = PeftModel.from_pretrained(model, lora_model) | |
| model = model.merge_and_unload() | |
| model.push_to_hub(output_repo) | |
| ``` | |
| The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can load with no idea it had adapters. | |
| --- | |
| ## 7 · Results | |
| ### Figure 1 — Reward distribution (CDF) | |
|  | |
| > *Empirical CDF of cumulative reward — lower curve = better (more probability mass at high reward).* | |
| - **Baseline** (green dashed, n=80): long left tail; ~40 % of rollouts under 0.75. | |
| - **SFT** (blue, n=64): consistent — fewer catastrophes, modest median. | |
| - **Posttrained RL** (red, n=100): dominates across nearly every quantile, with the steepest climb between 0.4 and 0.75 — that's where GRPO concentrated mass. | |
| ### Figure 2 — Efficiency curve (reward vs. steps) | |
|  | |
| | Model | Mean reward by ~30 steps | Steps to plateau | σ at plateau | | |
| | --- | --- | --- | --- | | |
| | Baseline | ~0.20 | never within 60 steps | wide | | |
| | SFT | ~0.95 | ~50 steps | medium | | |
| | **Posttrained RL** | **~1.59** | **~25 steps** | **tight** | | |
| > ✦ **The operationally meaningful number isn't the +1.10 reward — it's that the post-trained model gets there in *half the wall-clock steps*.** Fewer pages, less time-to-resolution. | |
| ### Component breakdown — Pool C (oracle-independent grader, n ≈ 100) | |
| | Metric | Base | RL | Δ | | |
| | --- | --- | --- | --- | | |
| | `mean_final` | 0.4495 | 0.4537 | ▲ 0.0042 | | |
| | `mean_p1_steps` | 16.62 | 15.75 | ▼ 0.87 | | |
| | `mean_p2_steps` | 5.62 | 6.50 | ▲ 0.88 | | |
| | `mean_r_cross` | 0.4412 | 0.4662 | ▲ 0.025 | | |
| > The per-step grader's `mean_final` moves only marginally on Pool C — the visible win is in **cumulative reward**, **CDF dominance**, and **`r_cross`** (+0.025), which is the actual training signal we cared about. The +0.88 P2-steps shift is intentional: the RL model learned to *use* the code workspace before patching, instead of one-shotting a wrong diff. | |
| ### Held-out — Pool D (n ≈ 16) | |
| | Metric | Base | RL | Δ | | |
| | --- | --- | --- | --- | | |
| | `mean_final` | 0.5565 | 0.5284 | ▼ 0.0281 | | |
| | Pearson r (P2 breadth) | +0.4951 | −0.3637 | ▼ 0.8588 | | |
| > ⚠ **We're flagging this honestly.** On the two compositional held-out scenarios, RL is slightly worse than baseline. The strong negative Pearson on P2 breadth tells us why: the RL model commits to a narrow code search early; on truly novel compounds, the base model's naïve breadth-first browsing is a better strategy. Fix path is in §9. | |
| --- | |
| ## 8 · Ablations | |
| ### A · `r_cross` on vs. off — the most informative knob | |
| | Condition | Δ `mean_final` (FT − Base) | Δ `mean_r_cross` | | |
| | --- | --- | --- | | |
| | `r_cross_on` | **▲ 0.0256** | ▲ 0.169 | | |
| | `r_cross_off` | ▲ 0.0054 | 0 | | |
| > Without the counterfactual reward, the fine-tuning gap shrinks ~80 %. Phase 1 has no incentive to produce a *useful* belief if you don't reward Phase 2 for using it. | |
| ### B · Stopping behaviour shifts by allocation, not total | |
| The fine-tuned model transitions to Phase 2 **0.87 steps earlier** and spends **0.88 steps more inside Phase 2**. Net step count is roughly flat — but the *budget allocation* improved. Less dashboard, more code. | |
| ### C · Source-type contribution | |
| | Source removed | Δ `mean_final` (Pool C) | | |
| | --- | --- | | |
| | Logs only | ▼ 0.04 | | |
| | Metrics only | ▼ 0.07 | | |
| | Git log + diffs | ▼ 0.13 | | |
| | Mini-repo file tree | ▼ 0.18 | | |
| > Code attribution is the single biggest contributor. Take away the repo and the agent loses ~40 % of its lift. | |
| ### D · Convergence proxy | |
| | Metric | Fine-tuned | Base | | |
| | --- | --- | --- | | |
| | Early-window mean_final | 0.7475 | 0.6425 | | |
| | Late-window mean_final | 0.4255 | 0.4620 | | |
| > Fine-tuned starts hotter and decays — has memorised some training-distribution heuristics. Consistent with the Pool D regression. This is the clearest place to push next. | |
| --- | |
| ## 9 · Limitations & honest caveats | |
| - **Pool D regression.** RL underperforms base by 0.028 on held-out compounds. Fix: Pool-D-shaped curriculum data + entropy bonus. | |
| - **Calibration regresses.** ECE 0.58 → 0.81 — RL is more confident without being more correct. The `BeliefState` aux-loss in `training/belief_aux_loss.py` is the place to wire it back in. | |
| - **Sample sizes are honest, not heroic.** Baseline n=80, SFT n=64, RL n=100; held-out n=16. Take the held-out number as directional. | |
| - **No code execution.** Phase 2 is read-only. Adding a sandboxed `pytest` action would close the largest fraction of remaining capability gap. | |
| - **Minimal system prompt.** A more elaborate scratchpad/belief-state prompt likely closes the SFT→RL gap further. We'd consider that a *positive* signal for the environment. | |
| --- | |
| ## 10 · Closing | |
| We set out to answer one question: *can a small open model, trained against a faithful incident-response simulator, become competitively useful at SRE triage?* | |
| On the training distribution: **yes, clearly.** On novel compounds: **not yet, but the training signal we built (`r_cross`) and the curriculum that uses it are correctly oriented toward fixing that.** And the most durable artefact from this submission isn't the score — it's the stack: | |
| | Artefact | Where | | |
| | --- | --- | | |
| | OpenEnv environment | `incident_env` (this repo) | | |
| | Hosted Space | `meta-hf-hackathon-updated-policy.hf.space` | | |
| | LoRA adapter | `daemongg/qwen2.5-7b-sre-grpo` | | |
| | Merged model | `Yaswanth-Bolla/qwen-merged` | | |
| | Trajectories | `sre_*_dataset.jsonl` (in repo) | | |
| | Training scripts | `sft.py`, `training/grpo_train.py`, `merge.py` | | |
| --- |