updated-policy / README.md
srinjoyd's picture
Update README.md
f2a72e6 verified
---
title: SRE Incident Response Simulator
emoji: 🚨
colorFrom: red
colorTo: gray
sdk: docker
app_port: 8000
pinned: false
---
# 🚨 SRE Triage Bot β€” OpenEnv Incident Response Simulator
> An OpenEnv environment + a four-stage GRPO pipeline that turns **Qwen2.5-7B-Instruct** into a working SRE triage agent. Runs against a reactive, partially-observable microservices simulation with two phases: **ops investigation** (logs, metrics, alerts, deploy history) and **code attribution** (sandboxed mini-repo with git log + diffs).
---
## πŸ”— Important Links
| Resource | Link |
| --- | --- |
| πŸ“ **Blog post (full write-up)** | [`BLOG.md`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/blob/main/BLOG.md) |
| πŸ›°οΈ **Live environment (HF Space)** | [Meta-HF-hackathon/updated-policy](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/) |
| 🧠 **Merged model (deployable)** | [`Yaswanth-Bolla/qwen-merged`](https://huggingface.co/Yaswanth-Bolla/qwen-merged) |
| 🧩 **LoRA adapter (post-GRPO)** | [`daemongg/qwen2.5-7b-sre-grpo`](https://huggingface.co/daemongg/qwen2.5-7b-sre-grpo) |
| πŸ—οΈ **Base model** | [`Qwen/Qwen2.5-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
| πŸ“’ **Logs + scripts** | [`logger`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/tree/main/logger) |
> ⚠️ **Note on training infrastructure.** We ran the full pipeline (SFT, GRPO, merge) on **HuggingFace Jobs** (A100-40GB) instead of a Colab notebook β€” Colab's free + Pro tiers OOM'd on the 7B base + reference model + GRPO group buffers. The **complete training logs and the exact scripts we executed** are committed under [`./logger/`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/tree/main/logger) (`sft_finetune.log`, `grpo_finetune.log`, `merge.log`, `trajectory.log`, `ablation.log`, plus the `.py` scripts that produced them) so the run is reproducible end-to-end.
---
## 🎯 What this submission delivers
- A novel **two-phase POMDP** environment with hierarchical, masked actions (10 ops actions + 7 code actions).
- A **two-layer reward** β€” dense oracle-shaped per-step signal for training, oracle-independent grader for evaluation.
- A **counterfactual cross-phase reward** (`r_cross`) that makes joint training meaningful.
- A four-pool **curriculum** (A β†’ B β†’ C, with held-out D) executed via on-policy **GRPO** with a variance gate and `r_cross` warmup.
- **Real measured improvement**: mean cumulative reward **β‰ˆ1.59 (RL) vs β‰ˆ0.49 (base)** at less than half the steps. See `BLOG.md` Β§7 and `ablation.md`.
---
## πŸ“ Environment at a glance
A **Partially-Observable Markov Decision Process** over a reactive microservices simulator. The agent never sees the root cause β€” it sees *symptoms*: climbing memory, cascading errors, firing alerts. It must gather evidence, transition to code attribution, and propose a patch β€” exactly like an on-call SRE at 3 AM.
| Dimension | Detail |
|---|---|
| **Observation** | Alerts Β· metric timeseries Β· structured logs Β· dependency graphs Β· deploy history Β· sandboxed repo tree + git log |
| **Action space** | Phase 1: 10 ops actions Γ— 7 services. Phase 2: 5 code-exploration + 2 terminal actions. |
| **Difficulty** | Easy (single-service leak) β†’ Medium (cascade) β†’ Hard (distributed deadlock) β†’ 5 research tasks β†’ 2 held-out compounds |
| **Reward** | Oracle-shaped per-step signal for training + oracle-independent grader for eval + counterfactual `r_cross` |
| **Realism** | Reactive simulation β€” memory climbs, cascades propagate, restarts don't fix root causes |
### Topology
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ API GW │──►│Auth │──► β”‚ Orders │──►│ Payment β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Cache β”‚ β”‚ DB β”‚ β”‚ Queue β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## πŸ”§ Action space (hierarchical + masked)
### Phase 1 β€” ops investigation
| Action | Category | Description |
|---|---|---|
| `view_alerts` | diagnostic | List firing alerts |
| `query_logs` | diagnostic | Service logs (level/keyword filters) |
| `check_metrics` | diagnostic | 30-min metric time series |
| `check_dependencies` | diagnostic | Up/downstream dependency map |
| `check_deploy_history` | diagnostic | Recent deploys per service |
| `run_health_check` | diagnostic | Ping a service |
| `restart_service` | remediation | Temporary fix |
| `rollback_deploy` | remediation | Real fix if root cause |
| `scale_service` | remediation | More replicas |
| `declare_root_cause` | terminal | Diagnosis string |
| `transition_to_phase2` | control | Hand off to code attribution |
### Phase 2 β€” code attribution
| Action | What it returns |
|---|---|
| `list_dir` | Files + subdirs at relative path |
| `read_file` | Up to 64 KB of file contents |
| `search_code` | grep across the tree (≀50 hits) |
| `get_git_log` | Commit metadata for a path |
| `get_file_diff` | Unified diff for `(commit_sha, path)` |
| `propose_patch` | Terminal β€” submit a unified diff |
| `declare_no_change` | Terminal β€” for spurious-issue scenarios |
> **Action masking:** every observation includes `valid_actions[]`. Illegal actions (e.g. rollback on a service with no deploy history) cost `-0.05` and are recorded for analysis.
---
## πŸ‘οΈ Observation space (POMDP)
The agent **never** sees: `fault_type`, `is_bad` deploy flag, internal simulation state.
It **does** see:
- Incident summary + severity (`SEV1` / `SEV2` / `SEV3`)
- Service statuses (`healthy` / `degraded` / `down`)
- Active alert count
- Action result (data from the most recent action)
- `valid_actions[]` (action mask)
- Time elapsed / budget (SLA pressure)
- Cumulative reward and step count
- `current_phase` ∈ {1, 2}
---
## πŸ“‹ Tasks (10 scenarios, 4 pools)
| Task | Difficulty | Hidden lesson |
|---|---|---|
| `memory_leak` | easy | Single service, noisy metric β€” restart only buys minutes |
| `cascading_failure` | medium | Loud services aren't the cause β€” walk the dep graph |
| `distributed_deadlock` | hard | Three remediation actions in a specific order |
| `aliased_fault` | research | Symptoms alias across fault families |
| `severity_inversion` | research | SEV1 page, two-line code fix |
| `confidence_inversion` | research | Loud alerts on the wrong service |
| `info_ordering` | research | Decisive evidence shows up *late* |
| `circuit_breaker_noop` | research | Spurious issue β€” `declare_no_change` is correct |
| `heldout_aliased_severity` | held-out | Compound; never seen during training |
| `heldout_confidence_ordering` | held-out | Compound; never seen during training |
Pools: **A** (`p1_only`), **B** (`p2_only` with oracle handoff), **C** (`joint` with `r_cross`), **D** (held-out generalisation).
---
## 🎁 Reward design (two layers)
### Layer 1 β€” per-step shaped reward (training only)
| Action | Condition | Reward |
|---|---|---|
| Diagnostic | involved service | +0.15 |
| Diagnostic | uninvolved service | +0.05 |
| Any | repeat | βˆ’0.05 |
| Remediation | correct target (root cause svc) | +0.30 |
| Remediation | helpful (affected, not root) | +0.10 |
| Remediation | harmful (healthy svc) | βˆ’0.15 |
| Declaration | correct root cause | +0.40 |
| Declaration | wrong root cause | βˆ’0.20 |
| Any | per-step efficiency cost | βˆ’0.02 |
| Completion | all services healthy | +0.20 |
| Completion | budget exceeded | βˆ’0.10 |
### Layer 2 β€” oracle-independent grader (evaluation)
| Component | Weight | Measures |
|---|---|---|
| `p1_rca` | 25 % | Did the agent declare the correct root cause? |
| `p1_efficiency` | 15 % | Fewer steps to declare = better |
| `patch_quality` | 35 % | File overlap (Jaccard) + AST hunk similarity + syntax validity |
| `no_change_detection` | 25 % | Correct `declare_no_change` on spurious-issue scenarios |
| `p2_efficiency` | 25 % | Phase-2 step efficiency (replaces `no_change` slot when valid issue) |
Plus the counterfactual cross-phase reward:
```
r_cross(Ο„) = max(0, r_code(Ο„_2 | context(Ο„_1)) βˆ’ r_code(Ο„_2 | βˆ…))
```
---
## πŸ“ˆ Headline result
| Model | Mean cumulative reward (β‰ˆ30 steps) | Steps to plateau | Οƒ at plateau |
|---|---|---|---|
| Base (Qwen2.5-7B-Instruct) | ~0.20 | never within 60 | wide |
| SFT (LoRA) | ~0.95 | ~50 | medium |
| **Post-trained (GRPO + merge)** | **~1.59** | **~25** | **tight** |
Full plots, ablations, and component breakdown in [`BLOG.md`](./BLOG.md) Β§7–8.
---
## πŸš€ Quick start
### Run the environment locally
```bash
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
```
```bash
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"task_name": "memory_leak"}'
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action_type": "view_alerts"}'
```
### Run the trained agent
```bash
export ENV_BASE_URL=http://localhost:8000
python inference.py --model Yaswanth-Bolla/qwen-merged
```
### Run the agent against a real GitHub issue + repo
```bash
python inference_agent.py \
--model Yaswanth-Bolla/qwen-merged \
--repo /path/to/cloned/repo \
--issue https://github.com/owner/repo/issues/42
```
### Docker
```bash
docker build -t incident-env .
docker run -p 8000:8000 incident-env
```
---
## πŸ‹οΈ Reproducing the training run
We ran every stage on **HuggingFace Jobs** (A100-40GB) β€” see [`./logger/`](./logger/) for the exact scripts and their full stdout.
```bash
# Stage 1 β€” collect baseline trajectories (HF Inference API)
python sre_finetune_collector.py # β†’ sre_*_dataset.jsonl
# Stage 2 β€” LoRA SFT via TRL
python sft.py \
--model_name_or_path Qwen/Qwen2.5-7B-Instruct \
--dataset_name <your-sft-dataset> \
--use_peft --lora_r 32 --lora_alpha 16 \
--learning_rate 2e-4 --num_train_epochs 1 \
--packing --eos_token '<|im_end|>' \
--output_dir Qwen2.5-7B-SRE-SFT --push_to_hub
# Stage 3+4 β€” online GRPO (Pool A β†’ B β†’ C)
python training/grpo_train.py \
--model <your-sft-checkpoint> \
--stages 2 3 4 \
--group_size 4 --episodes_per_task 64 \
--use_lora --lora_r 16 --lora_alpha 32 \
--push_to_hub daemongg/qwen2.5-7b-sre-grpo
# Stage 5 β€” merge LoRA into base
python merge.py
```
Logs from these exact runs:
| Stage | Log |
|---|---|
| Trajectory collection | [`logger/trajectory.log`](./logger/trajectory.log) |
| SFT | [`logger/sft_finetune.log`](./logger/sft_finetune.log) |
| GRPO | [`logger/grpo_finetune.log`](./logger/grpo_finetune.log) |
| Merge | [`logger/merge.log`](./logger/merge.log) |
| Ablations | [`logger/ablation.log`](./logger/ablation.log) |
---
## πŸ—‚οΈ Repository layout
```
.
β”œβ”€β”€ BLOG.md # Full write-up (start here)
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ ablation.md # Ablation results table
β”œβ”€β”€ openenv.yaml # OpenEnv spec
β”œβ”€β”€ server/ # FastAPI server + IncidentEnvironment + CodeWorkspace
β”œβ”€β”€ scenarios/ # 10 scenarios, code-context registry, P2 grader
β”œβ”€β”€ simulation/ # Reactive infra: services, metrics, logs, alerts
β”œβ”€β”€ snapshots/ # 8 mini-repo snapshots for Phase 2 (tree + git log + diffs)
β”œβ”€β”€ training/ # GRPO trainer, curriculum, variance gate, segment-GRPO loss
β”œβ”€β”€ sft.py # TRL SFTTrainer entry point
β”œβ”€β”€ merge.py # peft.merge_and_unload + push_to_hub
β”œβ”€β”€ inference.py # Run any LLM against the env
β”œβ”€β”€ inference_agent.py # Run the trained agent against a real repo + GitHub issue
β”œβ”€β”€ sre_finetune_collector.py # Stage-1 trajectory collector
β”œβ”€β”€ assets/ # Diagrams + result figures (referenced from BLOG.md)
└── logger/ # β˜… Full HF Jobs logs + the scripts that produced them
```
---
## πŸ’¬ Example interaction
```
Agent: POST /reset {"task_name": "memory_leak"}
β†’ Incident triggered: "Orders service experiencing failures..."
β†’ Services: orders=degraded, rest=healthy
Agent: POST /step {"action_type": "view_alerts"}
β†’ 3 alerts: orders HighMemoryUsage (critical), HighErrorRate, HighLatencyP99
β†’ reward = +0.13
Agent: POST /step {"action_type": "check_metrics", "target_service": "orders"}
β†’ 30 data points: memory climbing 35 % β†’ 78 % over 20 min
β†’ reward = +0.13
Agent: POST /step {"action_type": "check_deploy_history", "target_service": "orders"}
β†’ v2.3.1 (20 min ago, "batch order processing") Β· v1.2.0
β†’ reward = +0.13
Agent: POST /step {"action_type": "rollback_deploy", "target_service": "orders"}
β†’ "Rolled back orders v2.3.1 β†’ v1.2.0 β€” service recovering"
β†’ reward = +0.28
Agent: POST /step {"action_type": "declare_root_cause",
"parameters": {"root_cause": "memory leak in orders caused by bad deploy v2.3.1"}}
β†’ Episode done. Final grade: 0.97
```
---
## πŸ“œ License & credits
- Environment, training scripts, scenarios: this repo.
- Base model: `Qwen/Qwen2.5-7B-Instruct` (Apache-2.0).
- Trainer: HuggingFace TRL (`SFTTrainer`) and our on-policy GRPO loop in `training/grpo_train.py`.
- Built for the **OpenEnv hackathon** β€” see [`RULES.md`](./RULES.md).
For the full story, results, and ablations, read [`BLOG.md`](./BLOG.md).