Spaces:
Sleeping
title: SRE Incident Response Simulator
emoji: π¨
colorFrom: red
colorTo: gray
sdk: docker
app_port: 8000
pinned: false
π¨ SRE Triage Bot β OpenEnv Incident Response Simulator
An OpenEnv environment + a four-stage GRPO pipeline that turns Qwen2.5-7B-Instruct into a working SRE triage agent. Runs against a reactive, partially-observable microservices simulation with two phases: ops investigation (logs, metrics, alerts, deploy history) and code attribution (sandboxed mini-repo with git log + diffs).
π Important Links
| Resource | Link |
|---|---|
| π Blog post (full write-up) | BLOG.md |
| π°οΈ Live environment (HF Space) | Meta-HF-hackathon/updated-policy |
| π§ Merged model (deployable) | Yaswanth-Bolla/qwen-merged |
| π§© LoRA adapter (post-GRPO) | daemongg/qwen2.5-7b-sre-grpo |
| ποΈ Base model | Qwen/Qwen2.5-7B-Instruct |
| π Logs + scripts | logger |
β οΈ Note on training infrastructure. We ran the full pipeline (SFT, GRPO, merge) on HuggingFace Jobs (A100-40GB) instead of a Colab notebook β Colab's free + Pro tiers OOM'd on the 7B base + reference model + GRPO group buffers. The complete training logs and the exact scripts we executed are committed under
./logger/(sft_finetune.log,grpo_finetune.log,merge.log,trajectory.log,ablation.log, plus the.pyscripts that produced them) so the run is reproducible end-to-end.
π― What this submission delivers
- A novel two-phase POMDP environment with hierarchical, masked actions (10 ops actions + 7 code actions).
- A two-layer reward β dense oracle-shaped per-step signal for training, oracle-independent grader for evaluation.
- A counterfactual cross-phase reward (
r_cross) that makes joint training meaningful. - A four-pool curriculum (A β B β C, with held-out D) executed via on-policy GRPO with a variance gate and
r_crosswarmup. - Real measured improvement: mean cumulative reward β1.59 (RL) vs β0.49 (base) at less than half the steps. See
BLOG.mdΒ§7 andablation.md.
π Environment at a glance
A Partially-Observable Markov Decision Process over a reactive microservices simulator. The agent never sees the root cause β it sees symptoms: climbing memory, cascading errors, firing alerts. It must gather evidence, transition to code attribution, and propose a patch β exactly like an on-call SRE at 3 AM.
| Dimension | Detail |
|---|---|
| Observation | Alerts Β· metric timeseries Β· structured logs Β· dependency graphs Β· deploy history Β· sandboxed repo tree + git log |
| Action space | Phase 1: 10 ops actions Γ 7 services. Phase 2: 5 code-exploration + 2 terminal actions. |
| Difficulty | Easy (single-service leak) β Medium (cascade) β Hard (distributed deadlock) β 5 research tasks β 2 held-out compounds |
| Reward | Oracle-shaped per-step signal for training + oracle-independent grader for eval + counterfactual r_cross |
| Realism | Reactive simulation β memory climbs, cascades propagate, restarts don't fix root causes |
Topology
βββββββββββ βββββββ ββββββββββ βββββββββββ
β API GW ββββΊβAuth ββββΊ β Orders ββββΊβ Payment β
ββββββ¬βββββ βββββββ βββββ¬βββββ ββββββ¬βββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Cache β β DB β β Queue β
βββββββββββ βββββββββββ βββββββββββ
π§ Action space (hierarchical + masked)
Phase 1 β ops investigation
| Action | Category | Description |
|---|---|---|
view_alerts |
diagnostic | List firing alerts |
query_logs |
diagnostic | Service logs (level/keyword filters) |
check_metrics |
diagnostic | 30-min metric time series |
check_dependencies |
diagnostic | Up/downstream dependency map |
check_deploy_history |
diagnostic | Recent deploys per service |
run_health_check |
diagnostic | Ping a service |
restart_service |
remediation | Temporary fix |
rollback_deploy |
remediation | Real fix if root cause |
scale_service |
remediation | More replicas |
declare_root_cause |
terminal | Diagnosis string |
transition_to_phase2 |
control | Hand off to code attribution |
Phase 2 β code attribution
| Action | What it returns |
|---|---|
list_dir |
Files + subdirs at relative path |
read_file |
Up to 64 KB of file contents |
search_code |
grep across the tree (β€50 hits) |
get_git_log |
Commit metadata for a path |
get_file_diff |
Unified diff for (commit_sha, path) |
propose_patch |
Terminal β submit a unified diff |
declare_no_change |
Terminal β for spurious-issue scenarios |
Action masking: every observation includes
valid_actions[]. Illegal actions (e.g. rollback on a service with no deploy history) cost-0.05and are recorded for analysis.
ποΈ Observation space (POMDP)
The agent never sees: fault_type, is_bad deploy flag, internal simulation state.
It does see:
- Incident summary + severity (
SEV1/SEV2/SEV3) - Service statuses (
healthy/degraded/down) - Active alert count
- Action result (data from the most recent action)
valid_actions[](action mask)- Time elapsed / budget (SLA pressure)
- Cumulative reward and step count
current_phaseβ {1, 2}
π Tasks (10 scenarios, 4 pools)
| Task | Difficulty | Hidden lesson |
|---|---|---|
memory_leak |
easy | Single service, noisy metric β restart only buys minutes |
cascading_failure |
medium | Loud services aren't the cause β walk the dep graph |
distributed_deadlock |
hard | Three remediation actions in a specific order |
aliased_fault |
research | Symptoms alias across fault families |
severity_inversion |
research | SEV1 page, two-line code fix |
confidence_inversion |
research | Loud alerts on the wrong service |
info_ordering |
research | Decisive evidence shows up late |
circuit_breaker_noop |
research | Spurious issue β declare_no_change is correct |
heldout_aliased_severity |
held-out | Compound; never seen during training |
heldout_confidence_ordering |
held-out | Compound; never seen during training |
Pools: A (p1_only), B (p2_only with oracle handoff), C (joint with r_cross), D (held-out generalisation).
π Reward design (two layers)
Layer 1 β per-step shaped reward (training only)
| Action | Condition | Reward |
|---|---|---|
| Diagnostic | involved service | +0.15 |
| Diagnostic | uninvolved service | +0.05 |
| Any | repeat | β0.05 |
| Remediation | correct target (root cause svc) | +0.30 |
| Remediation | helpful (affected, not root) | +0.10 |
| Remediation | harmful (healthy svc) | β0.15 |
| Declaration | correct root cause | +0.40 |
| Declaration | wrong root cause | β0.20 |
| Any | per-step efficiency cost | β0.02 |
| Completion | all services healthy | +0.20 |
| Completion | budget exceeded | β0.10 |
Layer 2 β oracle-independent grader (evaluation)
| Component | Weight | Measures |
|---|---|---|
p1_rca |
25 % | Did the agent declare the correct root cause? |
p1_efficiency |
15 % | Fewer steps to declare = better |
patch_quality |
35 % | File overlap (Jaccard) + AST hunk similarity + syntax validity |
no_change_detection |
25 % | Correct declare_no_change on spurious-issue scenarios |
p2_efficiency |
25 % | Phase-2 step efficiency (replaces no_change slot when valid issue) |
Plus the counterfactual cross-phase reward:
r_cross(Ο) = max(0, r_code(Ο_2 | context(Ο_1)) β r_code(Ο_2 | β
))
π Headline result
| Model | Mean cumulative reward (β30 steps) | Steps to plateau | Ο at plateau |
|---|---|---|---|
| Base (Qwen2.5-7B-Instruct) | ~0.20 | never within 60 | wide |
| SFT (LoRA) | ~0.95 | ~50 | medium |
| Post-trained (GRPO + merge) | ~1.59 | ~25 | tight |
Full plots, ablations, and component breakdown in BLOG.md Β§7β8.
π Quick start
Run the environment locally
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"task_name": "memory_leak"}'
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action_type": "view_alerts"}'
Run the trained agent
export ENV_BASE_URL=http://localhost:8000
python inference.py --model Yaswanth-Bolla/qwen-merged
Run the agent against a real GitHub issue + repo
python inference_agent.py \
--model Yaswanth-Bolla/qwen-merged \
--repo /path/to/cloned/repo \
--issue https://github.com/owner/repo/issues/42
Docker
docker build -t incident-env .
docker run -p 8000:8000 incident-env
ποΈ Reproducing the training run
We ran every stage on HuggingFace Jobs (A100-40GB) β see ./logger/ for the exact scripts and their full stdout.
# Stage 1 β collect baseline trajectories (HF Inference API)
python sre_finetune_collector.py # β sre_*_dataset.jsonl
# Stage 2 β LoRA SFT via TRL
python sft.py \
--model_name_or_path Qwen/Qwen2.5-7B-Instruct \
--dataset_name <your-sft-dataset> \
--use_peft --lora_r 32 --lora_alpha 16 \
--learning_rate 2e-4 --num_train_epochs 1 \
--packing --eos_token '<|im_end|>' \
--output_dir Qwen2.5-7B-SRE-SFT --push_to_hub
# Stage 3+4 β online GRPO (Pool A β B β C)
python training/grpo_train.py \
--model <your-sft-checkpoint> \
--stages 2 3 4 \
--group_size 4 --episodes_per_task 64 \
--use_lora --lora_r 16 --lora_alpha 32 \
--push_to_hub daemongg/qwen2.5-7b-sre-grpo
# Stage 5 β merge LoRA into base
python merge.py
Logs from these exact runs:
| Stage | Log |
|---|---|
| Trajectory collection | logger/trajectory.log |
| SFT | logger/sft_finetune.log |
| GRPO | logger/grpo_finetune.log |
| Merge | logger/merge.log |
| Ablations | logger/ablation.log |
ποΈ Repository layout
.
βββ BLOG.md # Full write-up (start here)
βββ README.md # This file
βββ ablation.md # Ablation results table
βββ openenv.yaml # OpenEnv spec
βββ server/ # FastAPI server + IncidentEnvironment + CodeWorkspace
βββ scenarios/ # 10 scenarios, code-context registry, P2 grader
βββ simulation/ # Reactive infra: services, metrics, logs, alerts
βββ snapshots/ # 8 mini-repo snapshots for Phase 2 (tree + git log + diffs)
βββ training/ # GRPO trainer, curriculum, variance gate, segment-GRPO loss
βββ sft.py # TRL SFTTrainer entry point
βββ merge.py # peft.merge_and_unload + push_to_hub
βββ inference.py # Run any LLM against the env
βββ inference_agent.py # Run the trained agent against a real repo + GitHub issue
βββ sre_finetune_collector.py # Stage-1 trajectory collector
βββ assets/ # Diagrams + result figures (referenced from BLOG.md)
βββ logger/ # β
Full HF Jobs logs + the scripts that produced them
π¬ Example interaction
Agent: POST /reset {"task_name": "memory_leak"}
β Incident triggered: "Orders service experiencing failures..."
β Services: orders=degraded, rest=healthy
Agent: POST /step {"action_type": "view_alerts"}
β 3 alerts: orders HighMemoryUsage (critical), HighErrorRate, HighLatencyP99
β reward = +0.13
Agent: POST /step {"action_type": "check_metrics", "target_service": "orders"}
β 30 data points: memory climbing 35 % β 78 % over 20 min
β reward = +0.13
Agent: POST /step {"action_type": "check_deploy_history", "target_service": "orders"}
β v2.3.1 (20 min ago, "batch order processing") Β· v1.2.0
β reward = +0.13
Agent: POST /step {"action_type": "rollback_deploy", "target_service": "orders"}
β "Rolled back orders v2.3.1 β v1.2.0 β service recovering"
β reward = +0.28
Agent: POST /step {"action_type": "declare_root_cause",
"parameters": {"root_cause": "memory leak in orders caused by bad deploy v2.3.1"}}
β Episode done. Final grade: 0.97
π License & credits
- Environment, training scripts, scenarios: this repo.
- Base model:
Qwen/Qwen2.5-7B-Instruct(Apache-2.0). - Trainer: HuggingFace TRL (
SFTTrainer) and our on-policy GRPO loop intraining/grpo_train.py. - Built for the OpenEnv hackathon β see
RULES.md.
For the full story, results, and ablations, read BLOG.md.