Spaces:
Sleeping
Sleeping
| title: SRE Incident Response Simulator | |
| emoji: π¨ | |
| colorFrom: red | |
| colorTo: gray | |
| sdk: docker | |
| app_port: 8000 | |
| pinned: false | |
| # π¨ SRE Triage Bot β OpenEnv Incident Response Simulator | |
| > An OpenEnv environment + a four-stage GRPO pipeline that turns **Qwen2.5-7B-Instruct** into a working SRE triage agent. Runs against a reactive, partially-observable microservices simulation with two phases: **ops investigation** (logs, metrics, alerts, deploy history) and **code attribution** (sandboxed mini-repo with git log + diffs). | |
| --- | |
| ## π Important Links | |
| | Resource | Link | | |
| | --- | --- | | |
| | π **Blog post (full write-up)** | [`BLOG.md`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/blob/main/BLOG.md) | | |
| | π°οΈ **Live environment (HF Space)** | [Meta-HF-hackathon/updated-policy](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/) | | |
| | π§ **Merged model (deployable)** | [`Yaswanth-Bolla/qwen-merged`](https://huggingface.co/Yaswanth-Bolla/qwen-merged) | | |
| | π§© **LoRA adapter (post-GRPO)** | [`daemongg/qwen2.5-7b-sre-grpo`](https://huggingface.co/daemongg/qwen2.5-7b-sre-grpo) | | |
| | ποΈ **Base model** | [`Qwen/Qwen2.5-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | | |
| | π **Logs + scripts** | [`logger`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/tree/main/logger) | | |
| > β οΈ **Note on training infrastructure.** We ran the full pipeline (SFT, GRPO, merge) on **HuggingFace Jobs** (A100-40GB) instead of a Colab notebook β Colab's free + Pro tiers OOM'd on the 7B base + reference model + GRPO group buffers. The **complete training logs and the exact scripts we executed** are committed under [`./logger/`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/tree/main/logger) (`sft_finetune.log`, `grpo_finetune.log`, `merge.log`, `trajectory.log`, `ablation.log`, plus the `.py` scripts that produced them) so the run is reproducible end-to-end. | |
| --- | |
| ## π― What this submission delivers | |
| - A novel **two-phase POMDP** environment with hierarchical, masked actions (10 ops actions + 7 code actions). | |
| - A **two-layer reward** β dense oracle-shaped per-step signal for training, oracle-independent grader for evaluation. | |
| - A **counterfactual cross-phase reward** (`r_cross`) that makes joint training meaningful. | |
| - A four-pool **curriculum** (A β B β C, with held-out D) executed via on-policy **GRPO** with a variance gate and `r_cross` warmup. | |
| - **Real measured improvement**: mean cumulative reward **β1.59 (RL) vs β0.49 (base)** at less than half the steps. See `BLOG.md` Β§7 and `ablation.md`. | |
| --- | |
| ## π Environment at a glance | |
| A **Partially-Observable Markov Decision Process** over a reactive microservices simulator. The agent never sees the root cause β it sees *symptoms*: climbing memory, cascading errors, firing alerts. It must gather evidence, transition to code attribution, and propose a patch β exactly like an on-call SRE at 3 AM. | |
| | Dimension | Detail | | |
| |---|---| | |
| | **Observation** | Alerts Β· metric timeseries Β· structured logs Β· dependency graphs Β· deploy history Β· sandboxed repo tree + git log | | |
| | **Action space** | Phase 1: 10 ops actions Γ 7 services. Phase 2: 5 code-exploration + 2 terminal actions. | | |
| | **Difficulty** | Easy (single-service leak) β Medium (cascade) β Hard (distributed deadlock) β 5 research tasks β 2 held-out compounds | | |
| | **Reward** | Oracle-shaped per-step signal for training + oracle-independent grader for eval + counterfactual `r_cross` | | |
| | **Realism** | Reactive simulation β memory climbs, cascades propagate, restarts don't fix root causes | | |
| ### Topology | |
| ``` | |
| βββββββββββ βββββββ ββββββββββ βββββββββββ | |
| β API GW ββββΊβAuth ββββΊ β Orders ββββΊβ Payment β | |
| ββββββ¬βββββ βββββββ βββββ¬βββββ ββββββ¬βββββ | |
| βΌ βΌ βΌ | |
| βββββββββββ βββββββββββ βββββββββββ | |
| β Cache β β DB β β Queue β | |
| βββββββββββ βββββββββββ βββββββββββ | |
| ``` | |
| --- | |
| ## π§ Action space (hierarchical + masked) | |
| ### Phase 1 β ops investigation | |
| | Action | Category | Description | | |
| |---|---|---| | |
| | `view_alerts` | diagnostic | List firing alerts | | |
| | `query_logs` | diagnostic | Service logs (level/keyword filters) | | |
| | `check_metrics` | diagnostic | 30-min metric time series | | |
| | `check_dependencies` | diagnostic | Up/downstream dependency map | | |
| | `check_deploy_history` | diagnostic | Recent deploys per service | | |
| | `run_health_check` | diagnostic | Ping a service | | |
| | `restart_service` | remediation | Temporary fix | | |
| | `rollback_deploy` | remediation | Real fix if root cause | | |
| | `scale_service` | remediation | More replicas | | |
| | `declare_root_cause` | terminal | Diagnosis string | | |
| | `transition_to_phase2` | control | Hand off to code attribution | | |
| ### Phase 2 β code attribution | |
| | Action | What it returns | | |
| |---|---| | |
| | `list_dir` | Files + subdirs at relative path | | |
| | `read_file` | Up to 64 KB of file contents | | |
| | `search_code` | grep across the tree (β€50 hits) | | |
| | `get_git_log` | Commit metadata for a path | | |
| | `get_file_diff` | Unified diff for `(commit_sha, path)` | | |
| | `propose_patch` | Terminal β submit a unified diff | | |
| | `declare_no_change` | Terminal β for spurious-issue scenarios | | |
| > **Action masking:** every observation includes `valid_actions[]`. Illegal actions (e.g. rollback on a service with no deploy history) cost `-0.05` and are recorded for analysis. | |
| --- | |
| ## ποΈ Observation space (POMDP) | |
| The agent **never** sees: `fault_type`, `is_bad` deploy flag, internal simulation state. | |
| It **does** see: | |
| - Incident summary + severity (`SEV1` / `SEV2` / `SEV3`) | |
| - Service statuses (`healthy` / `degraded` / `down`) | |
| - Active alert count | |
| - Action result (data from the most recent action) | |
| - `valid_actions[]` (action mask) | |
| - Time elapsed / budget (SLA pressure) | |
| - Cumulative reward and step count | |
| - `current_phase` β {1, 2} | |
| --- | |
| ## π Tasks (10 scenarios, 4 pools) | |
| | Task | Difficulty | Hidden lesson | | |
| |---|---|---| | |
| | `memory_leak` | easy | Single service, noisy metric β restart only buys minutes | | |
| | `cascading_failure` | medium | Loud services aren't the cause β walk the dep graph | | |
| | `distributed_deadlock` | hard | Three remediation actions in a specific order | | |
| | `aliased_fault` | research | Symptoms alias across fault families | | |
| | `severity_inversion` | research | SEV1 page, two-line code fix | | |
| | `confidence_inversion` | research | Loud alerts on the wrong service | | |
| | `info_ordering` | research | Decisive evidence shows up *late* | | |
| | `circuit_breaker_noop` | research | Spurious issue β `declare_no_change` is correct | | |
| | `heldout_aliased_severity` | held-out | Compound; never seen during training | | |
| | `heldout_confidence_ordering` | held-out | Compound; never seen during training | | |
| Pools: **A** (`p1_only`), **B** (`p2_only` with oracle handoff), **C** (`joint` with `r_cross`), **D** (held-out generalisation). | |
| --- | |
| ## π Reward design (two layers) | |
| ### Layer 1 β per-step shaped reward (training only) | |
| | Action | Condition | Reward | | |
| |---|---|---| | |
| | Diagnostic | involved service | +0.15 | | |
| | Diagnostic | uninvolved service | +0.05 | | |
| | Any | repeat | β0.05 | | |
| | Remediation | correct target (root cause svc) | +0.30 | | |
| | Remediation | helpful (affected, not root) | +0.10 | | |
| | Remediation | harmful (healthy svc) | β0.15 | | |
| | Declaration | correct root cause | +0.40 | | |
| | Declaration | wrong root cause | β0.20 | | |
| | Any | per-step efficiency cost | β0.02 | | |
| | Completion | all services healthy | +0.20 | | |
| | Completion | budget exceeded | β0.10 | | |
| ### Layer 2 β oracle-independent grader (evaluation) | |
| | Component | Weight | Measures | | |
| |---|---|---| | |
| | `p1_rca` | 25 % | Did the agent declare the correct root cause? | | |
| | `p1_efficiency` | 15 % | Fewer steps to declare = better | | |
| | `patch_quality` | 35 % | File overlap (Jaccard) + AST hunk similarity + syntax validity | | |
| | `no_change_detection` | 25 % | Correct `declare_no_change` on spurious-issue scenarios | | |
| | `p2_efficiency` | 25 % | Phase-2 step efficiency (replaces `no_change` slot when valid issue) | | |
| Plus the counterfactual cross-phase reward: | |
| ``` | |
| r_cross(Ο) = max(0, r_code(Ο_2 | context(Ο_1)) β r_code(Ο_2 | β )) | |
| ``` | |
| --- | |
| ## π Headline result | |
| | Model | Mean cumulative reward (β30 steps) | Steps to plateau | Ο at plateau | | |
| |---|---|---|---| | |
| | Base (Qwen2.5-7B-Instruct) | ~0.20 | never within 60 | wide | | |
| | SFT (LoRA) | ~0.95 | ~50 | medium | | |
| | **Post-trained (GRPO + merge)** | **~1.59** | **~25** | **tight** | | |
| Full plots, ablations, and component breakdown in [`BLOG.md`](./BLOG.md) Β§7β8. | |
| --- | |
| ## π Quick start | |
| ### Run the environment locally | |
| ```bash | |
| pip install -e . | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ```bash | |
| curl http://localhost:8000/health | |
| curl -X POST http://localhost:8000/reset \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"task_name": "memory_leak"}' | |
| curl -X POST http://localhost:8000/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action_type": "view_alerts"}' | |
| ``` | |
| ### Run the trained agent | |
| ```bash | |
| export ENV_BASE_URL=http://localhost:8000 | |
| python inference.py --model Yaswanth-Bolla/qwen-merged | |
| ``` | |
| ### Run the agent against a real GitHub issue + repo | |
| ```bash | |
| python inference_agent.py \ | |
| --model Yaswanth-Bolla/qwen-merged \ | |
| --repo /path/to/cloned/repo \ | |
| --issue https://github.com/owner/repo/issues/42 | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker build -t incident-env . | |
| docker run -p 8000:8000 incident-env | |
| ``` | |
| --- | |
| ## ποΈ Reproducing the training run | |
| We ran every stage on **HuggingFace Jobs** (A100-40GB) β see [`./logger/`](./logger/) for the exact scripts and their full stdout. | |
| ```bash | |
| # Stage 1 β collect baseline trajectories (HF Inference API) | |
| python sre_finetune_collector.py # β sre_*_dataset.jsonl | |
| # Stage 2 β LoRA SFT via TRL | |
| python sft.py \ | |
| --model_name_or_path Qwen/Qwen2.5-7B-Instruct \ | |
| --dataset_name <your-sft-dataset> \ | |
| --use_peft --lora_r 32 --lora_alpha 16 \ | |
| --learning_rate 2e-4 --num_train_epochs 1 \ | |
| --packing --eos_token '<|im_end|>' \ | |
| --output_dir Qwen2.5-7B-SRE-SFT --push_to_hub | |
| # Stage 3+4 β online GRPO (Pool A β B β C) | |
| python training/grpo_train.py \ | |
| --model <your-sft-checkpoint> \ | |
| --stages 2 3 4 \ | |
| --group_size 4 --episodes_per_task 64 \ | |
| --use_lora --lora_r 16 --lora_alpha 32 \ | |
| --push_to_hub daemongg/qwen2.5-7b-sre-grpo | |
| # Stage 5 β merge LoRA into base | |
| python merge.py | |
| ``` | |
| Logs from these exact runs: | |
| | Stage | Log | | |
| |---|---| | |
| | Trajectory collection | [`logger/trajectory.log`](./logger/trajectory.log) | | |
| | SFT | [`logger/sft_finetune.log`](./logger/sft_finetune.log) | | |
| | GRPO | [`logger/grpo_finetune.log`](./logger/grpo_finetune.log) | | |
| | Merge | [`logger/merge.log`](./logger/merge.log) | | |
| | Ablations | [`logger/ablation.log`](./logger/ablation.log) | | |
| --- | |
| ## ποΈ Repository layout | |
| ``` | |
| . | |
| βββ BLOG.md # Full write-up (start here) | |
| βββ README.md # This file | |
| βββ ablation.md # Ablation results table | |
| βββ openenv.yaml # OpenEnv spec | |
| βββ server/ # FastAPI server + IncidentEnvironment + CodeWorkspace | |
| βββ scenarios/ # 10 scenarios, code-context registry, P2 grader | |
| βββ simulation/ # Reactive infra: services, metrics, logs, alerts | |
| βββ snapshots/ # 8 mini-repo snapshots for Phase 2 (tree + git log + diffs) | |
| βββ training/ # GRPO trainer, curriculum, variance gate, segment-GRPO loss | |
| βββ sft.py # TRL SFTTrainer entry point | |
| βββ merge.py # peft.merge_and_unload + push_to_hub | |
| βββ inference.py # Run any LLM against the env | |
| βββ inference_agent.py # Run the trained agent against a real repo + GitHub issue | |
| βββ sre_finetune_collector.py # Stage-1 trajectory collector | |
| βββ assets/ # Diagrams + result figures (referenced from BLOG.md) | |
| βββ logger/ # β Full HF Jobs logs + the scripts that produced them | |
| ``` | |
| --- | |
| ## π¬ Example interaction | |
| ``` | |
| Agent: POST /reset {"task_name": "memory_leak"} | |
| β Incident triggered: "Orders service experiencing failures..." | |
| β Services: orders=degraded, rest=healthy | |
| Agent: POST /step {"action_type": "view_alerts"} | |
| β 3 alerts: orders HighMemoryUsage (critical), HighErrorRate, HighLatencyP99 | |
| β reward = +0.13 | |
| Agent: POST /step {"action_type": "check_metrics", "target_service": "orders"} | |
| β 30 data points: memory climbing 35 % β 78 % over 20 min | |
| β reward = +0.13 | |
| Agent: POST /step {"action_type": "check_deploy_history", "target_service": "orders"} | |
| β v2.3.1 (20 min ago, "batch order processing") Β· v1.2.0 | |
| β reward = +0.13 | |
| Agent: POST /step {"action_type": "rollback_deploy", "target_service": "orders"} | |
| β "Rolled back orders v2.3.1 β v1.2.0 β service recovering" | |
| β reward = +0.28 | |
| Agent: POST /step {"action_type": "declare_root_cause", | |
| "parameters": {"root_cause": "memory leak in orders caused by bad deploy v2.3.1"}} | |
| β Episode done. Final grade: 0.97 | |
| ``` | |
| --- | |
| ## π License & credits | |
| - Environment, training scripts, scenarios: this repo. | |
| - Base model: `Qwen/Qwen2.5-7B-Instruct` (Apache-2.0). | |
| - Trainer: HuggingFace TRL (`SFTTrainer`) and our on-policy GRPO loop in `training/grpo_train.py`. | |
| - Built for the **OpenEnv hackathon** β see [`RULES.md`](./RULES.md). | |
| For the full story, results, and ablations, read [`BLOG.md`](./BLOG.md). | |