updated-policy / README.md
srinjoyd's picture
Update README.md
f2a72e6 verified
metadata
title: SRE Incident Response Simulator
emoji: 🚨
colorFrom: red
colorTo: gray
sdk: docker
app_port: 8000
pinned: false

🚨 SRE Triage Bot β€” OpenEnv Incident Response Simulator

An OpenEnv environment + a four-stage GRPO pipeline that turns Qwen2.5-7B-Instruct into a working SRE triage agent. Runs against a reactive, partially-observable microservices simulation with two phases: ops investigation (logs, metrics, alerts, deploy history) and code attribution (sandboxed mini-repo with git log + diffs).


πŸ”— Important Links

Resource Link
πŸ“ Blog post (full write-up) BLOG.md
πŸ›°οΈ Live environment (HF Space) Meta-HF-hackathon/updated-policy
🧠 Merged model (deployable) Yaswanth-Bolla/qwen-merged
🧩 LoRA adapter (post-GRPO) daemongg/qwen2.5-7b-sre-grpo
πŸ—οΈ Base model Qwen/Qwen2.5-7B-Instruct
πŸ“’ Logs + scripts logger

⚠️ Note on training infrastructure. We ran the full pipeline (SFT, GRPO, merge) on HuggingFace Jobs (A100-40GB) instead of a Colab notebook β€” Colab's free + Pro tiers OOM'd on the 7B base + reference model + GRPO group buffers. The complete training logs and the exact scripts we executed are committed under ./logger/ (sft_finetune.log, grpo_finetune.log, merge.log, trajectory.log, ablation.log, plus the .py scripts that produced them) so the run is reproducible end-to-end.


🎯 What this submission delivers

  • A novel two-phase POMDP environment with hierarchical, masked actions (10 ops actions + 7 code actions).
  • A two-layer reward β€” dense oracle-shaped per-step signal for training, oracle-independent grader for evaluation.
  • A counterfactual cross-phase reward (r_cross) that makes joint training meaningful.
  • A four-pool curriculum (A β†’ B β†’ C, with held-out D) executed via on-policy GRPO with a variance gate and r_cross warmup.
  • Real measured improvement: mean cumulative reward β‰ˆ1.59 (RL) vs β‰ˆ0.49 (base) at less than half the steps. See BLOG.md Β§7 and ablation.md.

πŸ“ Environment at a glance

A Partially-Observable Markov Decision Process over a reactive microservices simulator. The agent never sees the root cause β€” it sees symptoms: climbing memory, cascading errors, firing alerts. It must gather evidence, transition to code attribution, and propose a patch β€” exactly like an on-call SRE at 3 AM.

Dimension Detail
Observation Alerts Β· metric timeseries Β· structured logs Β· dependency graphs Β· deploy history Β· sandboxed repo tree + git log
Action space Phase 1: 10 ops actions Γ— 7 services. Phase 2: 5 code-exploration + 2 terminal actions.
Difficulty Easy (single-service leak) β†’ Medium (cascade) β†’ Hard (distributed deadlock) β†’ 5 research tasks β†’ 2 held-out compounds
Reward Oracle-shaped per-step signal for training + oracle-independent grader for eval + counterfactual r_cross
Realism Reactive simulation β€” memory climbs, cascades propagate, restarts don't fix root causes

Topology

   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ API GW  │──►│Auth │──► β”‚ Orders │──►│ Payment β”‚
   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
        β–Ό                       β–Ό             β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  Cache  β”‚            β”‚   DB    β”‚   β”‚  Queue  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”§ Action space (hierarchical + masked)

Phase 1 β€” ops investigation

Action Category Description
view_alerts diagnostic List firing alerts
query_logs diagnostic Service logs (level/keyword filters)
check_metrics diagnostic 30-min metric time series
check_dependencies diagnostic Up/downstream dependency map
check_deploy_history diagnostic Recent deploys per service
run_health_check diagnostic Ping a service
restart_service remediation Temporary fix
rollback_deploy remediation Real fix if root cause
scale_service remediation More replicas
declare_root_cause terminal Diagnosis string
transition_to_phase2 control Hand off to code attribution

Phase 2 β€” code attribution

Action What it returns
list_dir Files + subdirs at relative path
read_file Up to 64 KB of file contents
search_code grep across the tree (≀50 hits)
get_git_log Commit metadata for a path
get_file_diff Unified diff for (commit_sha, path)
propose_patch Terminal β€” submit a unified diff
declare_no_change Terminal β€” for spurious-issue scenarios

Action masking: every observation includes valid_actions[]. Illegal actions (e.g. rollback on a service with no deploy history) cost -0.05 and are recorded for analysis.


πŸ‘οΈ Observation space (POMDP)

The agent never sees: fault_type, is_bad deploy flag, internal simulation state.

It does see:

  • Incident summary + severity (SEV1 / SEV2 / SEV3)
  • Service statuses (healthy / degraded / down)
  • Active alert count
  • Action result (data from the most recent action)
  • valid_actions[] (action mask)
  • Time elapsed / budget (SLA pressure)
  • Cumulative reward and step count
  • current_phase ∈ {1, 2}

πŸ“‹ Tasks (10 scenarios, 4 pools)

Task Difficulty Hidden lesson
memory_leak easy Single service, noisy metric β€” restart only buys minutes
cascading_failure medium Loud services aren't the cause β€” walk the dep graph
distributed_deadlock hard Three remediation actions in a specific order
aliased_fault research Symptoms alias across fault families
severity_inversion research SEV1 page, two-line code fix
confidence_inversion research Loud alerts on the wrong service
info_ordering research Decisive evidence shows up late
circuit_breaker_noop research Spurious issue β€” declare_no_change is correct
heldout_aliased_severity held-out Compound; never seen during training
heldout_confidence_ordering held-out Compound; never seen during training

Pools: A (p1_only), B (p2_only with oracle handoff), C (joint with r_cross), D (held-out generalisation).


🎁 Reward design (two layers)

Layer 1 β€” per-step shaped reward (training only)

Action Condition Reward
Diagnostic involved service +0.15
Diagnostic uninvolved service +0.05
Any repeat βˆ’0.05
Remediation correct target (root cause svc) +0.30
Remediation helpful (affected, not root) +0.10
Remediation harmful (healthy svc) βˆ’0.15
Declaration correct root cause +0.40
Declaration wrong root cause βˆ’0.20
Any per-step efficiency cost βˆ’0.02
Completion all services healthy +0.20
Completion budget exceeded βˆ’0.10

Layer 2 β€” oracle-independent grader (evaluation)

Component Weight Measures
p1_rca 25 % Did the agent declare the correct root cause?
p1_efficiency 15 % Fewer steps to declare = better
patch_quality 35 % File overlap (Jaccard) + AST hunk similarity + syntax validity
no_change_detection 25 % Correct declare_no_change on spurious-issue scenarios
p2_efficiency 25 % Phase-2 step efficiency (replaces no_change slot when valid issue)

Plus the counterfactual cross-phase reward:

r_cross(Ο„) = max(0, r_code(Ο„_2 | context(Ο„_1)) βˆ’ r_code(Ο„_2 | βˆ…))

πŸ“ˆ Headline result

Model Mean cumulative reward (β‰ˆ30 steps) Steps to plateau Οƒ at plateau
Base (Qwen2.5-7B-Instruct) ~0.20 never within 60 wide
SFT (LoRA) ~0.95 ~50 medium
Post-trained (GRPO + merge) ~1.59 ~25 tight

Full plots, ablations, and component breakdown in BLOG.md Β§7–8.


πŸš€ Quick start

Run the environment locally

pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset \
     -H "Content-Type: application/json" \
     -d '{"task_name": "memory_leak"}'
curl -X POST http://localhost:8000/step \
     -H "Content-Type: application/json" \
     -d '{"action_type": "view_alerts"}'

Run the trained agent

export ENV_BASE_URL=http://localhost:8000
python inference.py --model Yaswanth-Bolla/qwen-merged

Run the agent against a real GitHub issue + repo

python inference_agent.py \
    --model  Yaswanth-Bolla/qwen-merged \
    --repo   /path/to/cloned/repo \
    --issue  https://github.com/owner/repo/issues/42

Docker

docker build -t incident-env .
docker run -p 8000:8000 incident-env

πŸ‹οΈ Reproducing the training run

We ran every stage on HuggingFace Jobs (A100-40GB) β€” see ./logger/ for the exact scripts and their full stdout.

# Stage 1 β€” collect baseline trajectories (HF Inference API)
python sre_finetune_collector.py            # β†’ sre_*_dataset.jsonl

# Stage 2 β€” LoRA SFT via TRL
python sft.py \
    --model_name_or_path Qwen/Qwen2.5-7B-Instruct \
    --dataset_name <your-sft-dataset> \
    --use_peft --lora_r 32 --lora_alpha 16 \
    --learning_rate 2e-4 --num_train_epochs 1 \
    --packing --eos_token '<|im_end|>' \
    --output_dir Qwen2.5-7B-SRE-SFT --push_to_hub

# Stage 3+4 β€” online GRPO (Pool A β†’ B β†’ C)
python training/grpo_train.py \
    --model     <your-sft-checkpoint> \
    --stages    2 3 4 \
    --group_size 4 --episodes_per_task 64 \
    --use_lora --lora_r 16 --lora_alpha 32 \
    --push_to_hub daemongg/qwen2.5-7b-sre-grpo

# Stage 5 β€” merge LoRA into base
python merge.py

Logs from these exact runs:


πŸ—‚οΈ Repository layout

.
β”œβ”€β”€ BLOG.md                    # Full write-up (start here)
β”œβ”€β”€ README.md                  # This file
β”œβ”€β”€ ablation.md                # Ablation results table
β”œβ”€β”€ openenv.yaml               # OpenEnv spec
β”œβ”€β”€ server/                    # FastAPI server + IncidentEnvironment + CodeWorkspace
β”œβ”€β”€ scenarios/                 # 10 scenarios, code-context registry, P2 grader
β”œβ”€β”€ simulation/                # Reactive infra: services, metrics, logs, alerts
β”œβ”€β”€ snapshots/                 # 8 mini-repo snapshots for Phase 2 (tree + git log + diffs)
β”œβ”€β”€ training/                  # GRPO trainer, curriculum, variance gate, segment-GRPO loss
β”œβ”€β”€ sft.py                     # TRL SFTTrainer entry point
β”œβ”€β”€ merge.py                   # peft.merge_and_unload + push_to_hub
β”œβ”€β”€ inference.py               # Run any LLM against the env
β”œβ”€β”€ inference_agent.py         # Run the trained agent against a real repo + GitHub issue
β”œβ”€β”€ sre_finetune_collector.py  # Stage-1 trajectory collector
β”œβ”€β”€ assets/                    # Diagrams + result figures (referenced from BLOG.md)
└── logger/                    # β˜… Full HF Jobs logs + the scripts that produced them

πŸ’¬ Example interaction

Agent: POST /reset {"task_name": "memory_leak"}
  β†’ Incident triggered: "Orders service experiencing failures..."
  β†’ Services: orders=degraded, rest=healthy

Agent: POST /step {"action_type": "view_alerts"}
  β†’ 3 alerts: orders HighMemoryUsage (critical), HighErrorRate, HighLatencyP99
  β†’ reward = +0.13

Agent: POST /step {"action_type": "check_metrics", "target_service": "orders"}
  β†’ 30 data points: memory climbing 35 % β†’ 78 % over 20 min
  β†’ reward = +0.13

Agent: POST /step {"action_type": "check_deploy_history", "target_service": "orders"}
  β†’ v2.3.1 (20 min ago, "batch order processing") Β· v1.2.0
  β†’ reward = +0.13

Agent: POST /step {"action_type": "rollback_deploy", "target_service": "orders"}
  β†’ "Rolled back orders v2.3.1 β†’ v1.2.0 β€” service recovering"
  β†’ reward = +0.28

Agent: POST /step {"action_type": "declare_root_cause",
                    "parameters": {"root_cause": "memory leak in orders caused by bad deploy v2.3.1"}}
  β†’ Episode done. Final grade: 0.97

πŸ“œ License & credits

  • Environment, training scripts, scenarios: this repo.
  • Base model: Qwen/Qwen2.5-7B-Instruct (Apache-2.0).
  • Trainer: HuggingFace TRL (SFTTrainer) and our on-policy GRPO loop in training/grpo_train.py.
  • Built for the OpenEnv hackathon β€” see RULES.md.

For the full story, results, and ablations, read BLOG.md.