Spaces:

Meta-HF-hackathon
/

updated-policy

Sleeping

App Files Files Community

updated-policy / README.md

srinjoyd

Update README.md

f2a72e6 verified 25 days ago

preview code

raw

history blame contribute delete

14 kB

	---
	title: SRE Incident Response Simulator
	emoji: 🚨
	colorFrom: red
	colorTo: gray
	sdk: docker
	app_port: 8000
	pinned: false
	---

	# 🚨 SRE Triage Bot — OpenEnv Incident Response Simulator

	> An OpenEnv environment + a four-stage GRPO pipeline that turns Qwen2.5-7B-Instruct into a working SRE triage agent. Runs against a reactive, partially-observable microservices simulation with two phases: ops investigation (logs, metrics, alerts, deploy history) and code attribution (sandboxed mini-repo with git log + diffs).

	---

	## 🔗 Important Links

	\| Resource \| Link \|
	\| --- \| --- \|
	\| 📝 Blog post (full write-up) \| [`BLOG.md`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/blob/main/BLOG.md) \|
	\| 🛰️ Live environment (HF Space) \| [Meta-HF-hackathon/updated-policy](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/) \|
	\| 🧠 Merged model (deployable) \| [`Yaswanth-Bolla/qwen-merged`](https://huggingface.co/Yaswanth-Bolla/qwen-merged) \|
	\| 🧩 LoRA adapter (post-GRPO) \| [`daemongg/qwen2.5-7b-sre-grpo`](https://huggingface.co/daemongg/qwen2.5-7b-sre-grpo) \|
	\| 🏗️ Base model \| [`Qwen/Qwen2.5-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) \|
	\| 📒 Logs + scripts \| [`logger`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/tree/main/logger) \|

	> ⚠️ Note on training infrastructure. We ran the full pipeline (SFT, GRPO, merge) on HuggingFace Jobs (A100-40GB) instead of a Colab notebook — Colab's free + Pro tiers OOM'd on the 7B base + reference model + GRPO group buffers. The complete training logs and the exact scripts we executed are committed under [`./logger/`](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/tree/main/logger) (`sft_finetune.log`, `grpo_finetune.log`, `merge.log`, `trajectory.log`, `ablation.log`, plus the `.py` scripts that produced them) so the run is reproducible end-to-end.

	---

	## 🎯 What this submission delivers

	- A novel two-phase POMDP environment with hierarchical, masked actions (10 ops actions + 7 code actions).
	- A two-layer reward — dense oracle-shaped per-step signal for training, oracle-independent grader for evaluation.
	- A counterfactual cross-phase reward (`r_cross`) that makes joint training meaningful.
	- A four-pool curriculum (A → B → C, with held-out D) executed via on-policy GRPO with a variance gate and `r_cross` warmup.
	- Real measured improvement: mean cumulative reward ≈1.59 (RL) vs ≈0.49 (base) at less than half the steps. See `BLOG.md` §7 and `ablation.md`.

	---

	## 📐 Environment at a glance

	A Partially-Observable Markov Decision Process over a reactive microservices simulator. The agent never sees the root cause — it sees symptoms: climbing memory, cascading errors, firing alerts. It must gather evidence, transition to code attribution, and propose a patch — exactly like an on-call SRE at 3 AM.

	\| Dimension \| Detail \|
	\|---\|---\|
	\| Observation \| Alerts · metric timeseries · structured logs · dependency graphs · deploy history · sandboxed repo tree + git log \|
	\| Action space \| Phase 1: 10 ops actions × 7 services. Phase 2: 5 code-exploration + 2 terminal actions. \|
	\| Difficulty \| Easy (single-service leak) → Medium (cascade) → Hard (distributed deadlock) → 5 research tasks → 2 held-out compounds \|
	\| Reward \| Oracle-shaped per-step signal for training + oracle-independent grader for eval + counterfactual `r_cross` \|
	\| Realism \| Reactive simulation — memory climbs, cascades propagate, restarts don't fix root causes \|

	### Topology

	```
	┌─────────┐ ┌─────┐ ┌────────┐ ┌─────────┐
	│ API GW │──►│Auth │──► │ Orders │──►│ Payment │
	└────┬────┘ └─────┘ └───┬────┘ └────┬────┘
	▼ ▼ ▼
	┌─────────┐ ┌─────────┐ ┌─────────┐
	│ Cache │ │ DB │ │ Queue │
	└─────────┘ └─────────┘ └─────────┘
	```

	---

	## 🔧 Action space (hierarchical + masked)

	### Phase 1 — ops investigation

	\| Action \| Category \| Description \|
	\|---\|---\|---\|
	\| `view_alerts` \| diagnostic \| List firing alerts \|
	\| `query_logs` \| diagnostic \| Service logs (level/keyword filters) \|
	\| `check_metrics` \| diagnostic \| 30-min metric time series \|
	\| `check_dependencies` \| diagnostic \| Up/downstream dependency map \|
	\| `check_deploy_history` \| diagnostic \| Recent deploys per service \|
	\| `run_health_check` \| diagnostic \| Ping a service \|
	\| `restart_service` \| remediation \| Temporary fix \|
	\| `rollback_deploy` \| remediation \| Real fix if root cause \|
	\| `scale_service` \| remediation \| More replicas \|
	\| `declare_root_cause` \| terminal \| Diagnosis string \|
	\| `transition_to_phase2` \| control \| Hand off to code attribution \|

	### Phase 2 — code attribution

	\| Action \| What it returns \|
	\|---\|---\|
	\| `list_dir` \| Files + subdirs at relative path \|
	\| `read_file` \| Up to 64 KB of file contents \|
	\| `search_code` \| grep across the tree (≤50 hits) \|
	\| `get_git_log` \| Commit metadata for a path \|
	\| `get_file_diff` \| Unified diff for `(commit_sha, path)` \|
	\| `propose_patch` \| Terminal — submit a unified diff \|
	\| `declare_no_change` \| Terminal — for spurious-issue scenarios \|

	> Action masking: every observation includes `valid_actions[]`. Illegal actions (e.g. rollback on a service with no deploy history) cost `-0.05` and are recorded for analysis.

	---

	## 👁️ Observation space (POMDP)

	The agent never sees: `fault_type`, `is_bad` deploy flag, internal simulation state.

	It does see:

	- Incident summary + severity (`SEV1` / `SEV2` / `SEV3`)
	- Service statuses (`healthy` / `degraded` / `down`)
	- Active alert count
	- Action result (data from the most recent action)
	- `valid_actions[]` (action mask)
	- Time elapsed / budget (SLA pressure)
	- Cumulative reward and step count
	- `current_phase` ∈ {1, 2}

	---

	## 📋 Tasks (10 scenarios, 4 pools)

	\| Task \| Difficulty \| Hidden lesson \|
	\|---\|---\|---\|
	\| `memory_leak` \| easy \| Single service, noisy metric — restart only buys minutes \|
	\| `cascading_failure` \| medium \| Loud services aren't the cause — walk the dep graph \|
	\| `distributed_deadlock` \| hard \| Three remediation actions in a specific order \|
	\| `aliased_fault` \| research \| Symptoms alias across fault families \|
	\| `severity_inversion` \| research \| SEV1 page, two-line code fix \|
	\| `confidence_inversion` \| research \| Loud alerts on the wrong service \|
	\| `info_ordering` \| research \| Decisive evidence shows up late \|
	\| `circuit_breaker_noop` \| research \| Spurious issue — `declare_no_change` is correct \|
	\| `heldout_aliased_severity` \| held-out \| Compound; never seen during training \|
	\| `heldout_confidence_ordering` \| held-out \| Compound; never seen during training \|

	Pools: A (`p1_only`), B (`p2_only` with oracle handoff), C (`joint` with `r_cross`), D (held-out generalisation).

	---

	## 🎁 Reward design (two layers)

	### Layer 1 — per-step shaped reward (training only)

	\| Action \| Condition \| Reward \|
	\|---\|---\|---\|
	\| Diagnostic \| involved service \| +0.15 \|
	\| Diagnostic \| uninvolved service \| +0.05 \|
	\| Any \| repeat \| −0.05 \|
	\| Remediation \| correct target (root cause svc) \| +0.30 \|
	\| Remediation \| helpful (affected, not root) \| +0.10 \|
	\| Remediation \| harmful (healthy svc) \| −0.15 \|
	\| Declaration \| correct root cause \| +0.40 \|
	\| Declaration \| wrong root cause \| −0.20 \|
	\| Any \| per-step efficiency cost \| −0.02 \|
	\| Completion \| all services healthy \| +0.20 \|
	\| Completion \| budget exceeded \| −0.10 \|

	### Layer 2 — oracle-independent grader (evaluation)

	\| Component \| Weight \| Measures \|
	\|---\|---\|---\|
	\| `p1_rca` \| 25 % \| Did the agent declare the correct root cause? \|
	\| `p1_efficiency` \| 15 % \| Fewer steps to declare = better \|
	\| `patch_quality` \| 35 % \| File overlap (Jaccard) + AST hunk similarity + syntax validity \|
	\| `no_change_detection` \| 25 % \| Correct `declare_no_change` on spurious-issue scenarios \|
	\| `p2_efficiency` \| 25 % \| Phase-2 step efficiency (replaces `no_change` slot when valid issue) \|

	Plus the counterfactual cross-phase reward:

	```
	r_cross(τ) = max(0, r_code(τ_2 \| context(τ_1)) − r_code(τ_2 \| ∅))
	```

	---

	## 📈 Headline result

	\| Model \| Mean cumulative reward (≈30 steps) \| Steps to plateau \| σ at plateau \|
	\|---\|---\|---\|---\|
	\| Base (Qwen2.5-7B-Instruct) \| ~0.20 \| never within 60 \| wide \|
	\| SFT (LoRA) \| ~0.95 \| ~50 \| medium \|
	\| Post-trained (GRPO + merge) \| ~1.59 \| ~25 \| tight \|

	Full plots, ablations, and component breakdown in [`BLOG.md`](./BLOG.md) §7–8.

	---

	## 🚀 Quick start

	### Run the environment locally

	```bash
	pip install -e .
	uvicorn server.app:app --host 0.0.0.0 --port 8000
	```

	```bash
	curl http://localhost:8000/health
	curl -X POST http://localhost:8000/reset \
	-H "Content-Type: application/json" \
	-d '{"task_name": "memory_leak"}'
	curl -X POST http://localhost:8000/step \
	-H "Content-Type: application/json" \
	-d '{"action_type": "view_alerts"}'
	```

	### Run the trained agent

	```bash
	export ENV_BASE_URL=http://localhost:8000
	python inference.py --model Yaswanth-Bolla/qwen-merged
	```

	### Run the agent against a real GitHub issue + repo

	```bash
	python inference_agent.py \
	--model Yaswanth-Bolla/qwen-merged \
	--repo /path/to/cloned/repo \
	--issue https://github.com/owner/repo/issues/42
	```

	### Docker

	```bash
	docker build -t incident-env .
	docker run -p 8000:8000 incident-env
	```

	---

	## 🏋️ Reproducing the training run

	We ran every stage on HuggingFace Jobs (A100-40GB) — see [`./logger/`](./logger/) for the exact scripts and their full stdout.

	```bash
	# Stage 1 — collect baseline trajectories (HF Inference API)
	python sre_finetune_collector.py # → sre_*_dataset.jsonl

	# Stage 2 — LoRA SFT via TRL
	python sft.py \
	--model_name_or_path Qwen/Qwen2.5-7B-Instruct \
	--dataset_name <your-sft-dataset> \
	--use_peft --lora_r 32 --lora_alpha 16 \
	--learning_rate 2e-4 --num_train_epochs 1 \
	--packing --eos_token '<\|im_end\|>' \
	--output_dir Qwen2.5-7B-SRE-SFT --push_to_hub

	# Stage 3+4 — online GRPO (Pool A → B → C)
	python training/grpo_train.py \
	--model <your-sft-checkpoint> \
	--stages 2 3 4 \
	--group_size 4 --episodes_per_task 64 \
	--use_lora --lora_r 16 --lora_alpha 32 \
	--push_to_hub daemongg/qwen2.5-7b-sre-grpo

	# Stage 5 — merge LoRA into base
	python merge.py
	```

	Logs from these exact runs:

	\| Stage \| Log \|
	\|---\|---\|
	\| Trajectory collection \| [`logger/trajectory.log`](./logger/trajectory.log) \|
	\| SFT \| [`logger/sft_finetune.log`](./logger/sft_finetune.log) \|
	\| GRPO \| [`logger/grpo_finetune.log`](./logger/grpo_finetune.log) \|
	\| Merge \| [`logger/merge.log`](./logger/merge.log) \|
	\| Ablations \| [`logger/ablation.log`](./logger/ablation.log) \|

	---

	## 🗂️ Repository layout

	```
	.
	├── BLOG.md # Full write-up (start here)
	├── README.md # This file
	├── ablation.md # Ablation results table
	├── openenv.yaml # OpenEnv spec
	├── server/ # FastAPI server + IncidentEnvironment + CodeWorkspace
	├── scenarios/ # 10 scenarios, code-context registry, P2 grader
	├── simulation/ # Reactive infra: services, metrics, logs, alerts
	├── snapshots/ # 8 mini-repo snapshots for Phase 2 (tree + git log + diffs)
	├── training/ # GRPO trainer, curriculum, variance gate, segment-GRPO loss
	├── sft.py # TRL SFTTrainer entry point
	├── merge.py # peft.merge_and_unload + push_to_hub
	├── inference.py # Run any LLM against the env
	├── inference_agent.py # Run the trained agent against a real repo + GitHub issue
	├── sre_finetune_collector.py # Stage-1 trajectory collector
	├── assets/ # Diagrams + result figures (referenced from BLOG.md)
	└── logger/ # ★ Full HF Jobs logs + the scripts that produced them
	```

	---

	## 💬 Example interaction

	```
	Agent: POST /reset {"task_name": "memory_leak"}
	→ Incident triggered: "Orders service experiencing failures..."
	→ Services: orders=degraded, rest=healthy

	Agent: POST /step {"action_type": "view_alerts"}
	→ 3 alerts: orders HighMemoryUsage (critical), HighErrorRate, HighLatencyP99
	→ reward = +0.13

	Agent: POST /step {"action_type": "check_metrics", "target_service": "orders"}
	→ 30 data points: memory climbing 35 % → 78 % over 20 min
	→ reward = +0.13

	Agent: POST /step {"action_type": "check_deploy_history", "target_service": "orders"}
	→ v2.3.1 (20 min ago, "batch order processing") · v1.2.0
	→ reward = +0.13

	Agent: POST /step {"action_type": "rollback_deploy", "target_service": "orders"}
	→ "Rolled back orders v2.3.1 → v1.2.0 — service recovering"
	→ reward = +0.28

	Agent: POST /step {"action_type": "declare_root_cause",
	"parameters": {"root_cause": "memory leak in orders caused by bad deploy v2.3.1"}}
	→ Episode done. Final grade: 0.97
	```

	---

	## 📜 License & credits

	- Environment, training scripts, scenarios: this repo.
	- Base model: `Qwen/Qwen2.5-7B-Instruct` (Apache-2.0).
	- Trainer: HuggingFace TRL (`SFTTrainer`) and our on-policy GRPO loop in `training/grpo_train.py`.
	- Built for the OpenEnv hackathon — see [`RULES.md`](./RULES.md).

	For the full story, results, and ablations, read [`BLOG.md`](./BLOG.md).