Spaces:

Meta-HF-hackathon
/

updated-policy

Sleeping

App Files Files Community

updated-policy / README.md

srinjoyd

Update README.md

f2a72e6 verified 24 days ago

preview code

raw

history blame contribute delete

14 kB

metadata

title: SRE Incident Response Simulator
emoji: 🚨
colorFrom: red
colorTo: gray
sdk: docker
app_port: 8000
pinned: false

🚨 SRE Triage Bot — OpenEnv Incident Response Simulator

An OpenEnv environment + a four-stage GRPO pipeline that turns Qwen2.5-7B-Instruct into a working SRE triage agent. Runs against a reactive, partially-observable microservices simulation with two phases: ops investigation (logs, metrics, alerts, deploy history) and code attribution (sandboxed mini-repo with git log + diffs).

🔗 Important Links

Resource	Link
📝 Blog post (full write-up)	`BLOG.md`
🛰️ Live environment (HF Space)	Meta-HF-hackathon/updated-policy
🧠 Merged model (deployable)	`Yaswanth-Bolla/qwen-merged`
🧩 LoRA adapter (post-GRPO)	`daemongg/qwen2.5-7b-sre-grpo`
🏗️ Base model	`Qwen/Qwen2.5-7B-Instruct`
📒 Logs + scripts	`logger`

⚠️ Note on training infrastructure. We ran the full pipeline (SFT, GRPO, merge) on HuggingFace Jobs (A100-40GB) instead of a Colab notebook — Colab's free + Pro tiers OOM'd on the 7B base + reference model + GRPO group buffers. The complete training logs and the exact scripts we executed are committed under ./logger/ (sft_finetune.log, grpo_finetune.log, merge.log, trajectory.log, ablation.log, plus the .py scripts that produced them) so the run is reproducible end-to-end.

🎯 What this submission delivers

A novel two-phase POMDP environment with hierarchical, masked actions (10 ops actions + 7 code actions).
A two-layer reward — dense oracle-shaped per-step signal for training, oracle-independent grader for evaluation.
A counterfactual cross-phase reward (r_cross) that makes joint training meaningful.
A four-pool curriculum (A → B → C, with held-out D) executed via on-policy GRPO with a variance gate and r_cross warmup.
Real measured improvement: mean cumulative reward ≈1.59 (RL) vs ≈0.49 (base) at less than half the steps. See BLOG.md §7 and ablation.md.

📐 Environment at a glance

A Partially-Observable Markov Decision Process over a reactive microservices simulator. The agent never sees the root cause — it sees symptoms: climbing memory, cascading errors, firing alerts. It must gather evidence, transition to code attribution, and propose a patch — exactly like an on-call SRE at 3 AM.

Dimension	Detail
Observation	Alerts · metric timeseries · structured logs · dependency graphs · deploy history · sandboxed repo tree + git log
Action space	Phase 1: 10 ops actions × 7 services. Phase 2: 5 code-exploration + 2 terminal actions.
Difficulty	Easy (single-service leak) → Medium (cascade) → Hard (distributed deadlock) → 5 research tasks → 2 held-out compounds
Reward	Oracle-shaped per-step signal for training + oracle-independent grader for eval + counterfactual `r_cross`
Realism	Reactive simulation — memory climbs, cascades propagate, restarts don't fix root causes

Topology

   ┌─────────┐   ┌─────┐    ┌────────┐   ┌─────────┐
   │ API GW  │──►│Auth │──► │ Orders │──►│ Payment │
   └────┬────┘   └─────┘    └───┬────┘   └────┬────┘
        ▼                       ▼             ▼
   ┌─────────┐            ┌─────────┐   ┌─────────┐
   │  Cache  │            │   DB    │   │  Queue  │
   └─────────┘            └─────────┘   └─────────┘

🔧 Action space (hierarchical + masked)

Phase 1 — ops investigation

Action	Category	Description
`view_alerts`	diagnostic	List firing alerts
`query_logs`	diagnostic	Service logs (level/keyword filters)
`check_metrics`	diagnostic	30-min metric time series
`check_dependencies`	diagnostic	Up/downstream dependency map
`check_deploy_history`	diagnostic	Recent deploys per service
`run_health_check`	diagnostic	Ping a service
`restart_service`	remediation	Temporary fix
`rollback_deploy`	remediation	Real fix if root cause
`scale_service`	remediation	More replicas
`declare_root_cause`	terminal	Diagnosis string
`transition_to_phase2`	control	Hand off to code attribution

Phase 2 — code attribution

Action	What it returns
`list_dir`	Files + subdirs at relative path
`read_file`	Up to 64 KB of file contents
`search_code`	grep across the tree (≤50 hits)
`get_git_log`	Commit metadata for a path
`get_file_diff`	Unified diff for `(commit_sha, path)`
`propose_patch`	Terminal — submit a unified diff
`declare_no_change`	Terminal — for spurious-issue scenarios

Action masking: every observation includes valid_actions[]. Illegal actions (e.g. rollback on a service with no deploy history) cost -0.05 and are recorded for analysis.

👁️ Observation space (POMDP)

The agent never sees: fault_type, is_bad deploy flag, internal simulation state.

It does see:

Incident summary + severity (SEV1 / SEV2 / SEV3)
Service statuses (healthy / degraded / down)
Active alert count
Action result (data from the most recent action)
valid_actions[] (action mask)
Time elapsed / budget (SLA pressure)
Cumulative reward and step count
current_phase ∈ {1, 2}

📋 Tasks (10 scenarios, 4 pools)

Task	Difficulty	Hidden lesson
`memory_leak`	easy	Single service, noisy metric — restart only buys minutes
`cascading_failure`	medium	Loud services aren't the cause — walk the dep graph
`distributed_deadlock`	hard	Three remediation actions in a specific order
`aliased_fault`	research	Symptoms alias across fault families
`severity_inversion`	research	SEV1 page, two-line code fix
`confidence_inversion`	research	Loud alerts on the wrong service
`info_ordering`	research	Decisive evidence shows up late
`circuit_breaker_noop`	research	Spurious issue — `declare_no_change` is correct
`heldout_aliased_severity`	held-out	Compound; never seen during training
`heldout_confidence_ordering`	held-out	Compound; never seen during training

Pools: A (p1_only), B (p2_only with oracle handoff), C (joint with r_cross), D (held-out generalisation).

🎁 Reward design (two layers)

Layer 1 — per-step shaped reward (training only)

Action	Condition	Reward
Diagnostic	involved service	+0.15
Diagnostic	uninvolved service	+0.05
Any	repeat	−0.05
Remediation	correct target (root cause svc)	+0.30
Remediation	helpful (affected, not root)	+0.10
Remediation	harmful (healthy svc)	−0.15
Declaration	correct root cause	+0.40
Declaration	wrong root cause	−0.20
Any	per-step efficiency cost	−0.02
Completion	all services healthy	+0.20
Completion	budget exceeded	−0.10

Layer 2 — oracle-independent grader (evaluation)

Component	Weight	Measures
`p1_rca`	25 %	Did the agent declare the correct root cause?
`p1_efficiency`	15 %	Fewer steps to declare = better
`patch_quality`	35 %	File overlap (Jaccard) + AST hunk similarity + syntax validity
`no_change_detection`	25 %	Correct `declare_no_change` on spurious-issue scenarios
`p2_efficiency`	25 %	Phase-2 step efficiency (replaces `no_change` slot when valid issue)

Plus the counterfactual cross-phase reward:

r_cross(τ) = max(0, r_code(τ_2 | context(τ_1)) − r_code(τ_2 | ∅))

📈 Headline result

Model	Mean cumulative reward (≈30 steps)	Steps to plateau	σ at plateau
Base (Qwen2.5-7B-Instruct)	~0.20	never within 60	wide
SFT (LoRA)	~0.95	~50	medium
Post-trained (GRPO + merge)	~1.59	~25	tight

Full plots, ablations, and component breakdown in BLOG.md §7–8.

🚀 Quick start

Run the environment locally

pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000

curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset \
     -H "Content-Type: application/json" \
     -d '{"task_name": "memory_leak"}'
curl -X POST http://localhost:8000/step \
     -H "Content-Type: application/json" \
     -d '{"action_type": "view_alerts"}'

Run the trained agent

export ENV_BASE_URL=http://localhost:8000
python inference.py --model Yaswanth-Bolla/qwen-merged

Run the agent against a real GitHub issue + repo

python inference_agent.py \
    --model  Yaswanth-Bolla/qwen-merged \
    --repo   /path/to/cloned/repo \
    --issue  https://github.com/owner/repo/issues/42

Docker

docker build -t incident-env .
docker run -p 8000:8000 incident-env

🏋️ Reproducing the training run

We ran every stage on HuggingFace Jobs (A100-40GB) — see ./logger/ for the exact scripts and their full stdout.

# Stage 1 — collect baseline trajectories (HF Inference API)
python sre_finetune_collector.py            # → sre_*_dataset.jsonl

# Stage 2 — LoRA SFT via TRL
python sft.py \
    --model_name_or_path Qwen/Qwen2.5-7B-Instruct \
    --dataset_name <your-sft-dataset> \
    --use_peft --lora_r 32 --lora_alpha 16 \
    --learning_rate 2e-4 --num_train_epochs 1 \
    --packing --eos_token '<|im_end|>' \
    --output_dir Qwen2.5-7B-SRE-SFT --push_to_hub

# Stage 3+4 — online GRPO (Pool A → B → C)
python training/grpo_train.py \
    --model     <your-sft-checkpoint> \
    --stages    2 3 4 \
    --group_size 4 --episodes_per_task 64 \
    --use_lora --lora_r 16 --lora_alpha 32 \
    --push_to_hub daemongg/qwen2.5-7b-sre-grpo

# Stage 5 — merge LoRA into base
python merge.py

Logs from these exact runs:

Stage	Log
Trajectory collection	`logger/trajectory.log`
SFT	`logger/sft_finetune.log`
GRPO	`logger/grpo_finetune.log`
Merge	`logger/merge.log`
Ablations	`logger/ablation.log`

🗂️ Repository layout

.
├── BLOG.md                    # Full write-up (start here)
├── README.md                  # This file
├── ablation.md                # Ablation results table
├── openenv.yaml               # OpenEnv spec
├── server/                    # FastAPI server + IncidentEnvironment + CodeWorkspace
├── scenarios/                 # 10 scenarios, code-context registry, P2 grader
├── simulation/                # Reactive infra: services, metrics, logs, alerts
├── snapshots/                 # 8 mini-repo snapshots for Phase 2 (tree + git log + diffs)
├── training/                  # GRPO trainer, curriculum, variance gate, segment-GRPO loss
├── sft.py                     # TRL SFTTrainer entry point
├── merge.py                   # peft.merge_and_unload + push_to_hub
├── inference.py               # Run any LLM against the env
├── inference_agent.py         # Run the trained agent against a real repo + GitHub issue
├── sre_finetune_collector.py  # Stage-1 trajectory collector
├── assets/                    # Diagrams + result figures (referenced from BLOG.md)
└── logger/                    # ★ Full HF Jobs logs + the scripts that produced them

💬 Example interaction

Agent: POST /reset {"task_name": "memory_leak"}
  → Incident triggered: "Orders service experiencing failures..."
  → Services: orders=degraded, rest=healthy

Agent: POST /step {"action_type": "view_alerts"}
  → 3 alerts: orders HighMemoryUsage (critical), HighErrorRate, HighLatencyP99
  → reward = +0.13

Agent: POST /step {"action_type": "check_metrics", "target_service": "orders"}
  → 30 data points: memory climbing 35 % → 78 % over 20 min
  → reward = +0.13

Agent: POST /step {"action_type": "check_deploy_history", "target_service": "orders"}
  → v2.3.1 (20 min ago, "batch order processing") · v1.2.0
  → reward = +0.13

Agent: POST /step {"action_type": "rollback_deploy", "target_service": "orders"}
  → "Rolled back orders v2.3.1 → v1.2.0 — service recovering"
  → reward = +0.28

Agent: POST /step {"action_type": "declare_root_cause",
                    "parameters": {"root_cause": "memory leak in orders caused by bad deploy v2.3.1"}}
  → Episode done. Final grade: 0.97

📜 License & credits

Environment, training scripts, scenarios: this repo.
Base model: Qwen/Qwen2.5-7B-Instruct (Apache-2.0).
Trainer: HuggingFace TRL (SFTTrainer) and our on-policy GRPO loop in training/grpo_train.py.
Built for the OpenEnv hackathon — see RULES.md.

For the full story, results, and ablations, read BLOG.md.