--- title: SRE Incident Response Simulator emoji: 🚨 colorFrom: red colorTo: gray sdk: docker app_port: 8000 pinned: false --- # 🚨 SRE Triage Bot β€” OpenEnv Incident Response Simulator > An OpenEnv environment + a four-stage GRPO pipeline that turns **Qwen2.5-7B-Instruct** into a working SRE triage agent. Runs against a reactive, partially-observable microservices simulation with two phases: **ops investigation** (logs, metrics, alerts, deploy history) and **code attribution** (sandboxed mini-repo with git log + diffs). --- ## πŸ”— Important Links | Resource | Link | | --- | --- | | πŸ“ **Blog post (full write-up)** | [`BLOG.md`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/blob/main/BLOG.md) | | πŸ›°οΈ **Live environment (HF Space)** | [Meta-HF-hackathon/updated-policy](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/) | | 🧠 **Merged model (deployable)** | [`Yaswanth-Bolla/qwen-merged`](https://huggingface.co/Yaswanth-Bolla/qwen-merged) | | 🧩 **LoRA adapter (post-GRPO)** | [`daemongg/qwen2.5-7b-sre-grpo`](https://huggingface.co/daemongg/qwen2.5-7b-sre-grpo) | | πŸ—οΈ **Base model** | [`Qwen/Qwen2.5-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | | πŸ“’ **Logs + scripts** | [`logger`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/tree/main/logger) | > ⚠️ **Note on training infrastructure.** We ran the full pipeline (SFT, GRPO, merge) on **HuggingFace Jobs** (A100-40GB) instead of a Colab notebook β€” Colab's free + Pro tiers OOM'd on the 7B base + reference model + GRPO group buffers. The **complete training logs and the exact scripts we executed** are committed under [`./logger/`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/tree/main/logger) (`sft_finetune.log`, `grpo_finetune.log`, `merge.log`, `trajectory.log`, `ablation.log`, plus the `.py` scripts that produced them) so the run is reproducible end-to-end. --- ## 🎯 What this submission delivers - A novel **two-phase POMDP** environment with hierarchical, masked actions (10 ops actions + 7 code actions). - A **two-layer reward** β€” dense oracle-shaped per-step signal for training, oracle-independent grader for evaluation. - A **counterfactual cross-phase reward** (`r_cross`) that makes joint training meaningful. - A four-pool **curriculum** (A β†’ B β†’ C, with held-out D) executed via on-policy **GRPO** with a variance gate and `r_cross` warmup. - **Real measured improvement**: mean cumulative reward **β‰ˆ1.59 (RL) vs β‰ˆ0.49 (base)** at less than half the steps. See `BLOG.md` Β§7 and `ablation.md`. --- ## πŸ“ Environment at a glance A **Partially-Observable Markov Decision Process** over a reactive microservices simulator. The agent never sees the root cause β€” it sees *symptoms*: climbing memory, cascading errors, firing alerts. It must gather evidence, transition to code attribution, and propose a patch β€” exactly like an on-call SRE at 3 AM. | Dimension | Detail | |---|---| | **Observation** | Alerts Β· metric timeseries Β· structured logs Β· dependency graphs Β· deploy history Β· sandboxed repo tree + git log | | **Action space** | Phase 1: 10 ops actions Γ— 7 services. Phase 2: 5 code-exploration + 2 terminal actions. | | **Difficulty** | Easy (single-service leak) β†’ Medium (cascade) β†’ Hard (distributed deadlock) β†’ 5 research tasks β†’ 2 held-out compounds | | **Reward** | Oracle-shaped per-step signal for training + oracle-independent grader for eval + counterfactual `r_cross` | | **Realism** | Reactive simulation β€” memory climbs, cascades propagate, restarts don't fix root causes | ### Topology ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ API GW │──►│Auth │──► β”‚ Orders │──►│ Payment β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Cache β”‚ β”‚ DB β”‚ β”‚ Queue β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` --- ## πŸ”§ Action space (hierarchical + masked) ### Phase 1 β€” ops investigation | Action | Category | Description | |---|---|---| | `view_alerts` | diagnostic | List firing alerts | | `query_logs` | diagnostic | Service logs (level/keyword filters) | | `check_metrics` | diagnostic | 30-min metric time series | | `check_dependencies` | diagnostic | Up/downstream dependency map | | `check_deploy_history` | diagnostic | Recent deploys per service | | `run_health_check` | diagnostic | Ping a service | | `restart_service` | remediation | Temporary fix | | `rollback_deploy` | remediation | Real fix if root cause | | `scale_service` | remediation | More replicas | | `declare_root_cause` | terminal | Diagnosis string | | `transition_to_phase2` | control | Hand off to code attribution | ### Phase 2 β€” code attribution | Action | What it returns | |---|---| | `list_dir` | Files + subdirs at relative path | | `read_file` | Up to 64 KB of file contents | | `search_code` | grep across the tree (≀50 hits) | | `get_git_log` | Commit metadata for a path | | `get_file_diff` | Unified diff for `(commit_sha, path)` | | `propose_patch` | Terminal β€” submit a unified diff | | `declare_no_change` | Terminal β€” for spurious-issue scenarios | > **Action masking:** every observation includes `valid_actions[]`. Illegal actions (e.g. rollback on a service with no deploy history) cost `-0.05` and are recorded for analysis. --- ## πŸ‘οΈ Observation space (POMDP) The agent **never** sees: `fault_type`, `is_bad` deploy flag, internal simulation state. It **does** see: - Incident summary + severity (`SEV1` / `SEV2` / `SEV3`) - Service statuses (`healthy` / `degraded` / `down`) - Active alert count - Action result (data from the most recent action) - `valid_actions[]` (action mask) - Time elapsed / budget (SLA pressure) - Cumulative reward and step count - `current_phase` ∈ {1, 2} --- ## πŸ“‹ Tasks (10 scenarios, 4 pools) | Task | Difficulty | Hidden lesson | |---|---|---| | `memory_leak` | easy | Single service, noisy metric β€” restart only buys minutes | | `cascading_failure` | medium | Loud services aren't the cause β€” walk the dep graph | | `distributed_deadlock` | hard | Three remediation actions in a specific order | | `aliased_fault` | research | Symptoms alias across fault families | | `severity_inversion` | research | SEV1 page, two-line code fix | | `confidence_inversion` | research | Loud alerts on the wrong service | | `info_ordering` | research | Decisive evidence shows up *late* | | `circuit_breaker_noop` | research | Spurious issue β€” `declare_no_change` is correct | | `heldout_aliased_severity` | held-out | Compound; never seen during training | | `heldout_confidence_ordering` | held-out | Compound; never seen during training | Pools: **A** (`p1_only`), **B** (`p2_only` with oracle handoff), **C** (`joint` with `r_cross`), **D** (held-out generalisation). --- ## 🎁 Reward design (two layers) ### Layer 1 β€” per-step shaped reward (training only) | Action | Condition | Reward | |---|---|---| | Diagnostic | involved service | +0.15 | | Diagnostic | uninvolved service | +0.05 | | Any | repeat | βˆ’0.05 | | Remediation | correct target (root cause svc) | +0.30 | | Remediation | helpful (affected, not root) | +0.10 | | Remediation | harmful (healthy svc) | βˆ’0.15 | | Declaration | correct root cause | +0.40 | | Declaration | wrong root cause | βˆ’0.20 | | Any | per-step efficiency cost | βˆ’0.02 | | Completion | all services healthy | +0.20 | | Completion | budget exceeded | βˆ’0.10 | ### Layer 2 β€” oracle-independent grader (evaluation) | Component | Weight | Measures | |---|---|---| | `p1_rca` | 25 % | Did the agent declare the correct root cause? | | `p1_efficiency` | 15 % | Fewer steps to declare = better | | `patch_quality` | 35 % | File overlap (Jaccard) + AST hunk similarity + syntax validity | | `no_change_detection` | 25 % | Correct `declare_no_change` on spurious-issue scenarios | | `p2_efficiency` | 25 % | Phase-2 step efficiency (replaces `no_change` slot when valid issue) | Plus the counterfactual cross-phase reward: ``` r_cross(Ο„) = max(0, r_code(Ο„_2 | context(Ο„_1)) βˆ’ r_code(Ο„_2 | βˆ…)) ``` --- ## πŸ“ˆ Headline result | Model | Mean cumulative reward (β‰ˆ30 steps) | Steps to plateau | Οƒ at plateau | |---|---|---|---| | Base (Qwen2.5-7B-Instruct) | ~0.20 | never within 60 | wide | | SFT (LoRA) | ~0.95 | ~50 | medium | | **Post-trained (GRPO + merge)** | **~1.59** | **~25** | **tight** | Full plots, ablations, and component breakdown in [`BLOG.md`](./BLOG.md) Β§7–8. --- ## πŸš€ Quick start ### Run the environment locally ```bash pip install -e . uvicorn server.app:app --host 0.0.0.0 --port 8000 ``` ```bash curl http://localhost:8000/health curl -X POST http://localhost:8000/reset \ -H "Content-Type: application/json" \ -d '{"task_name": "memory_leak"}' curl -X POST http://localhost:8000/step \ -H "Content-Type: application/json" \ -d '{"action_type": "view_alerts"}' ``` ### Run the trained agent ```bash export ENV_BASE_URL=http://localhost:8000 python inference.py --model Yaswanth-Bolla/qwen-merged ``` ### Run the agent against a real GitHub issue + repo ```bash python inference_agent.py \ --model Yaswanth-Bolla/qwen-merged \ --repo /path/to/cloned/repo \ --issue https://github.com/owner/repo/issues/42 ``` ### Docker ```bash docker build -t incident-env . docker run -p 8000:8000 incident-env ``` --- ## πŸ‹οΈ Reproducing the training run We ran every stage on **HuggingFace Jobs** (A100-40GB) β€” see [`./logger/`](./logger/) for the exact scripts and their full stdout. ```bash # Stage 1 β€” collect baseline trajectories (HF Inference API) python sre_finetune_collector.py # β†’ sre_*_dataset.jsonl # Stage 2 β€” LoRA SFT via TRL python sft.py \ --model_name_or_path Qwen/Qwen2.5-7B-Instruct \ --dataset_name \ --use_peft --lora_r 32 --lora_alpha 16 \ --learning_rate 2e-4 --num_train_epochs 1 \ --packing --eos_token '<|im_end|>' \ --output_dir Qwen2.5-7B-SRE-SFT --push_to_hub # Stage 3+4 β€” online GRPO (Pool A β†’ B β†’ C) python training/grpo_train.py \ --model \ --stages 2 3 4 \ --group_size 4 --episodes_per_task 64 \ --use_lora --lora_r 16 --lora_alpha 32 \ --push_to_hub daemongg/qwen2.5-7b-sre-grpo # Stage 5 β€” merge LoRA into base python merge.py ``` Logs from these exact runs: | Stage | Log | |---|---| | Trajectory collection | [`logger/trajectory.log`](./logger/trajectory.log) | | SFT | [`logger/sft_finetune.log`](./logger/sft_finetune.log) | | GRPO | [`logger/grpo_finetune.log`](./logger/grpo_finetune.log) | | Merge | [`logger/merge.log`](./logger/merge.log) | | Ablations | [`logger/ablation.log`](./logger/ablation.log) | --- ## πŸ—‚οΈ Repository layout ``` . β”œβ”€β”€ BLOG.md # Full write-up (start here) β”œβ”€β”€ README.md # This file β”œβ”€β”€ ablation.md # Ablation results table β”œβ”€β”€ openenv.yaml # OpenEnv spec β”œβ”€β”€ server/ # FastAPI server + IncidentEnvironment + CodeWorkspace β”œβ”€β”€ scenarios/ # 10 scenarios, code-context registry, P2 grader β”œβ”€β”€ simulation/ # Reactive infra: services, metrics, logs, alerts β”œβ”€β”€ snapshots/ # 8 mini-repo snapshots for Phase 2 (tree + git log + diffs) β”œβ”€β”€ training/ # GRPO trainer, curriculum, variance gate, segment-GRPO loss β”œβ”€β”€ sft.py # TRL SFTTrainer entry point β”œβ”€β”€ merge.py # peft.merge_and_unload + push_to_hub β”œβ”€β”€ inference.py # Run any LLM against the env β”œβ”€β”€ inference_agent.py # Run the trained agent against a real repo + GitHub issue β”œβ”€β”€ sre_finetune_collector.py # Stage-1 trajectory collector β”œβ”€β”€ assets/ # Diagrams + result figures (referenced from BLOG.md) └── logger/ # β˜… Full HF Jobs logs + the scripts that produced them ``` --- ## πŸ’¬ Example interaction ``` Agent: POST /reset {"task_name": "memory_leak"} β†’ Incident triggered: "Orders service experiencing failures..." β†’ Services: orders=degraded, rest=healthy Agent: POST /step {"action_type": "view_alerts"} β†’ 3 alerts: orders HighMemoryUsage (critical), HighErrorRate, HighLatencyP99 β†’ reward = +0.13 Agent: POST /step {"action_type": "check_metrics", "target_service": "orders"} β†’ 30 data points: memory climbing 35 % β†’ 78 % over 20 min β†’ reward = +0.13 Agent: POST /step {"action_type": "check_deploy_history", "target_service": "orders"} β†’ v2.3.1 (20 min ago, "batch order processing") Β· v1.2.0 β†’ reward = +0.13 Agent: POST /step {"action_type": "rollback_deploy", "target_service": "orders"} β†’ "Rolled back orders v2.3.1 β†’ v1.2.0 β€” service recovering" β†’ reward = +0.28 Agent: POST /step {"action_type": "declare_root_cause", "parameters": {"root_cause": "memory leak in orders caused by bad deploy v2.3.1"}} β†’ Episode done. Final grade: 0.97 ``` --- ## πŸ“œ License & credits - Environment, training scripts, scenarios: this repo. - Base model: `Qwen/Qwen2.5-7B-Instruct` (Apache-2.0). - Trainer: HuggingFace TRL (`SFTTrainer`) and our on-policy GRPO loop in `training/grpo_train.py`. - Built for the **OpenEnv hackathon** β€” see [`RULES.md`](./RULES.md). For the full story, results, and ablations, read [`BLOG.md`](./BLOG.md).