---
title: SRE Incident Response Simulator
emoji: 🚨
colorFrom: red
colorTo: gray
sdk: docker
app_port: 8000
pinned: false
---

# 🚨 SRE Triage Bot — OpenEnv Incident Response Simulator

> An OpenEnv environment + a four-stage GRPO pipeline that turns **Qwen2.5-7B-Instruct** into a working SRE triage agent. Runs against a reactive, partially-observable microservices simulation with two phases: **ops investigation** (logs, metrics, alerts, deploy history) and **code attribution** (sandboxed mini-repo with git log + diffs).

---

## 🔗 Important Links

| Resource | Link |
| --- | --- |
| 📝 **Blog post (full write-up)** | [`BLOG.md`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/blob/main/BLOG.md) |
| 🛰️ **Live environment (HF Space)** | [Meta-HF-hackathon/updated-policy](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/) |
| 🧠 **Merged model (deployable)** | [`Yaswanth-Bolla/qwen-merged`](https://huggingface.co/Yaswanth-Bolla/qwen-merged) |
| 🧩 **LoRA adapter (post-GRPO)** | [`daemongg/qwen2.5-7b-sre-grpo`](https://huggingface.co/daemongg/qwen2.5-7b-sre-grpo) |
| 🏗️ **Base model** | [`Qwen/Qwen2.5-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
| 📒 **Logs + scripts** | [`logger`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/tree/main/logger) |

> ⚠️ **Note on training infrastructure.** We ran the full pipeline (SFT, GRPO, merge) on **HuggingFace Jobs** (A100-40GB) instead of a Colab notebook — Colab's free + Pro tiers OOM'd on the 7B base + reference model + GRPO group buffers. The **complete training logs and the exact scripts we executed** are committed under [`./logger/`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/tree/main/logger) (`sft_finetune.log`, `grpo_finetune.log`, `merge.log`, `trajectory.log`, `ablation.log`, plus the `.py` scripts that produced them) so the run is reproducible end-to-end.

---

## 🎯 What this submission delivers

- A novel **two-phase POMDP** environment with hierarchical, masked actions (10 ops actions + 7 code actions).
- A **two-layer reward** — dense oracle-shaped per-step signal for training, oracle-independent grader for evaluation.
- A **counterfactual cross-phase reward** (`r_cross`) that makes joint training meaningful.
- A four-pool **curriculum** (A → B → C, with held-out D) executed via on-policy **GRPO** with a variance gate and `r_cross` warmup.
- **Real measured improvement**: mean cumulative reward **≈1.59 (RL) vs ≈0.49 (base)** at less than half the steps. See `BLOG.md` §7 and `ablation.md`.

---

## 📐 Environment at a glance

A **Partially-Observable Markov Decision Process** over a reactive microservices simulator. The agent never sees the root cause — it sees *symptoms*: climbing memory, cascading errors, firing alerts. It must gather evidence, transition to code attribution, and propose a patch — exactly like an on-call SRE at 3 AM.

| Dimension | Detail |
|---|---|
| **Observation** | Alerts · metric timeseries · structured logs · dependency graphs · deploy history · sandboxed repo tree + git log |
| **Action space** | Phase 1: 10 ops actions × 7 services. Phase 2: 5 code-exploration + 2 terminal actions. |
| **Difficulty** | Easy (single-service leak) → Medium (cascade) → Hard (distributed deadlock) → 5 research tasks → 2 held-out compounds |
| **Reward** | Oracle-shaped per-step signal for training + oracle-independent grader for eval + counterfactual `r_cross` |
| **Realism** | Reactive simulation — memory climbs, cascades propagate, restarts don't fix root causes |

### Topology

```
   ┌─────────┐   ┌─────┐    ┌────────┐   ┌─────────┐
   │ API GW  │──►│Auth │──► │ Orders │──►│ Payment │
   └────┬────┘   └─────┘    └───┬────┘   └────┬────┘
        ▼                       ▼             ▼
   ┌─────────┐            ┌─────────┐   ┌─────────┐
   │  Cache  │            │   DB    │   │  Queue  │
   └─────────┘            └─────────┘   └─────────┘
```

---

## 🔧 Action space (hierarchical + masked)

### Phase 1 — ops investigation

| Action | Category | Description |
|---|---|---|
| `view_alerts` | diagnostic | List firing alerts |
| `query_logs` | diagnostic | Service logs (level/keyword filters) |
| `check_metrics` | diagnostic | 30-min metric time series |
| `check_dependencies` | diagnostic | Up/downstream dependency map |
| `check_deploy_history` | diagnostic | Recent deploys per service |
| `run_health_check` | diagnostic | Ping a service |
| `restart_service` | remediation | Temporary fix |
| `rollback_deploy` | remediation | Real fix if root cause |
| `scale_service` | remediation | More replicas |
| `declare_root_cause` | terminal | Diagnosis string |
| `transition_to_phase2` | control | Hand off to code attribution |

### Phase 2 — code attribution

| Action | What it returns |
|---|---|
| `list_dir` | Files + subdirs at relative path |
| `read_file` | Up to 64 KB of file contents |
| `search_code` | grep across the tree (≤50 hits) |
| `get_git_log` | Commit metadata for a path |
| `get_file_diff` | Unified diff for `(commit_sha, path)` |
| `propose_patch` | Terminal — submit a unified diff |
| `declare_no_change` | Terminal — for spurious-issue scenarios |

> **Action masking:** every observation includes `valid_actions[]`. Illegal actions (e.g. rollback on a service with no deploy history) cost `-0.05` and are recorded for analysis.

---

## 👁️ Observation space (POMDP)

The agent **never** sees: `fault_type`, `is_bad` deploy flag, internal simulation state.

It **does** see:

- Incident summary + severity (`SEV1` / `SEV2` / `SEV3`)
- Service statuses (`healthy` / `degraded` / `down`)
- Active alert count
- Action result (data from the most recent action)
- `valid_actions[]` (action mask)
- Time elapsed / budget (SLA pressure)
- Cumulative reward and step count
- `current_phase` ∈ {1, 2}

---

## 📋 Tasks (10 scenarios, 4 pools)

| Task | Difficulty | Hidden lesson |
|---|---|---|
| `memory_leak` | easy | Single service, noisy metric — restart only buys minutes |
| `cascading_failure` | medium | Loud services aren't the cause — walk the dep graph |
| `distributed_deadlock` | hard | Three remediation actions in a specific order |
| `aliased_fault` | research | Symptoms alias across fault families |
| `severity_inversion` | research | SEV1 page, two-line code fix |
| `confidence_inversion` | research | Loud alerts on the wrong service |
| `info_ordering` | research | Decisive evidence shows up *late* |
| `circuit_breaker_noop` | research | Spurious issue — `declare_no_change` is correct |
| `heldout_aliased_severity` | held-out | Compound; never seen during training |
| `heldout_confidence_ordering` | held-out | Compound; never seen during training |

Pools: **A** (`p1_only`), **B** (`p2_only` with oracle handoff), **C** (`joint` with `r_cross`), **D** (held-out generalisation).

---

## 🎁 Reward design (two layers)

### Layer 1 — per-step shaped reward (training only)

| Action | Condition | Reward |
|---|---|---|
| Diagnostic | involved service | +0.15 |
| Diagnostic | uninvolved service | +0.05 |
| Any | repeat | −0.05 |
| Remediation | correct target (root cause svc) | +0.30 |
| Remediation | helpful (affected, not root) | +0.10 |
| Remediation | harmful (healthy svc) | −0.15 |
| Declaration | correct root cause | +0.40 |
| Declaration | wrong root cause | −0.20 |
| Any | per-step efficiency cost | −0.02 |
| Completion | all services healthy | +0.20 |
| Completion | budget exceeded | −0.10 |

### Layer 2 — oracle-independent grader (evaluation)

| Component | Weight | Measures |
|---|---|---|
| `p1_rca` | 25 % | Did the agent declare the correct root cause? |
| `p1_efficiency` | 15 % | Fewer steps to declare = better |
| `patch_quality` | 35 % | File overlap (Jaccard) + AST hunk similarity + syntax validity |
| `no_change_detection` | 25 % | Correct `declare_no_change` on spurious-issue scenarios |
| `p2_efficiency` | 25 % | Phase-2 step efficiency (replaces `no_change` slot when valid issue) |

Plus the counterfactual cross-phase reward:

```
r_cross(τ) = max(0, r_code(τ_2 | context(τ_1)) − r_code(τ_2 | ∅))
```

---

## 📈 Headline result

| Model | Mean cumulative reward (≈30 steps) | Steps to plateau | σ at plateau |
|---|---|---|---|
| Base (Qwen2.5-7B-Instruct) | ~0.20 | never within 60 | wide |
| SFT (LoRA) | ~0.95 | ~50 | medium |
| **Post-trained (GRPO + merge)** | **~1.59** | **~25** | **tight** |

Full plots, ablations, and component breakdown in [`BLOG.md`](./BLOG.md) §7–8.

---

## 🚀 Quick start

### Run the environment locally

```bash
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
```

```bash
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset \
     -H "Content-Type: application/json" \
     -d '{"task_name": "memory_leak"}'
curl -X POST http://localhost:8000/step \
     -H "Content-Type: application/json" \
     -d '{"action_type": "view_alerts"}'
```

### Run the trained agent

```bash
export ENV_BASE_URL=http://localhost:8000
python inference.py --model Yaswanth-Bolla/qwen-merged
```

### Run the agent against a real GitHub issue + repo

```bash
python inference_agent.py \
    --model  Yaswanth-Bolla/qwen-merged \
    --repo   /path/to/cloned/repo \
    --issue  https://github.com/owner/repo/issues/42
```

### Docker

```bash
docker build -t incident-env .
docker run -p 8000:8000 incident-env
```

---

## 🏋️ Reproducing the training run

We ran every stage on **HuggingFace Jobs** (A100-40GB) — see [`./logger/`](./logger/) for the exact scripts and their full stdout.

```bash
# Stage 1 — collect baseline trajectories (HF Inference API)
python sre_finetune_collector.py            # → sre_*_dataset.jsonl

# Stage 2 — LoRA SFT via TRL
python sft.py \
    --model_name_or_path Qwen/Qwen2.5-7B-Instruct \
    --dataset_name <your-sft-dataset> \
    --use_peft --lora_r 32 --lora_alpha 16 \
    --learning_rate 2e-4 --num_train_epochs 1 \
    --packing --eos_token '<|im_end|>' \
    --output_dir Qwen2.5-7B-SRE-SFT --push_to_hub

# Stage 3+4 — online GRPO (Pool A → B → C)
python training/grpo_train.py \
    --model     <your-sft-checkpoint> \
    --stages    2 3 4 \
    --group_size 4 --episodes_per_task 64 \
    --use_lora --lora_r 16 --lora_alpha 32 \
    --push_to_hub daemongg/qwen2.5-7b-sre-grpo

# Stage 5 — merge LoRA into base
python merge.py
```

Logs from these exact runs:

| Stage | Log |
|---|---|
| Trajectory collection | [`logger/trajectory.log`](./logger/trajectory.log) |
| SFT | [`logger/sft_finetune.log`](./logger/sft_finetune.log) |
| GRPO | [`logger/grpo_finetune.log`](./logger/grpo_finetune.log) |
| Merge | [`logger/merge.log`](./logger/merge.log) |
| Ablations | [`logger/ablation.log`](./logger/ablation.log) |

---

## 🗂️ Repository layout

```
.
├── BLOG.md                    # Full write-up (start here)
├── README.md                  # This file
├── ablation.md                # Ablation results table
├── openenv.yaml               # OpenEnv spec
├── server/                    # FastAPI server + IncidentEnvironment + CodeWorkspace
├── scenarios/                 # 10 scenarios, code-context registry, P2 grader
├── simulation/                # Reactive infra: services, metrics, logs, alerts
├── snapshots/                 # 8 mini-repo snapshots for Phase 2 (tree + git log + diffs)
├── training/                  # GRPO trainer, curriculum, variance gate, segment-GRPO loss
├── sft.py                     # TRL SFTTrainer entry point
├── merge.py                   # peft.merge_and_unload + push_to_hub
├── inference.py               # Run any LLM against the env
├── inference_agent.py         # Run the trained agent against a real repo + GitHub issue
├── sre_finetune_collector.py  # Stage-1 trajectory collector
├── assets/                    # Diagrams + result figures (referenced from BLOG.md)
└── logger/                    # ★ Full HF Jobs logs + the scripts that produced them
```

---

## 💬 Example interaction

```
Agent: POST /reset {"task_name": "memory_leak"}
  → Incident triggered: "Orders service experiencing failures..."
  → Services: orders=degraded, rest=healthy

Agent: POST /step {"action_type": "view_alerts"}
  → 3 alerts: orders HighMemoryUsage (critical), HighErrorRate, HighLatencyP99
  → reward = +0.13

Agent: POST /step {"action_type": "check_metrics", "target_service": "orders"}
  → 30 data points: memory climbing 35 % → 78 % over 20 min
  → reward = +0.13

Agent: POST /step {"action_type": "check_deploy_history", "target_service": "orders"}
  → v2.3.1 (20 min ago, "batch order processing") · v1.2.0
  → reward = +0.13

Agent: POST /step {"action_type": "rollback_deploy", "target_service": "orders"}
  → "Rolled back orders v2.3.1 → v1.2.0 — service recovering"
  → reward = +0.28

Agent: POST /step {"action_type": "declare_root_cause",
                    "parameters": {"root_cause": "memory leak in orders caused by bad deploy v2.3.1"}}
  → Episode done. Final grade: 0.97
```

---

## 📜 License & credits

- Environment, training scripts, scenarios: this repo.
- Base model: `Qwen/Qwen2.5-7B-Instruct` (Apache-2.0).
- Trainer: HuggingFace TRL (`SFTTrainer`) and our on-policy GRPO loop in `training/grpo_train.py`.
- Built for the **OpenEnv hackathon** — see [`RULES.md`](./RULES.md).

For the full story, results, and ablations, read [`BLOG.md`](./BLOG.md).