Yaswanth-Bolla's picture
update base url
7f52668
metadata
title: SRE Incident Response Simulator
emoji: 🚨
colorFrom: red
colorTo: gray
sdk: docker
app_port: 8000
pinned: false

🚨 SRE Incident Response Simulator

An OpenEnv environment where AI agents must diagnose and remediate production incidents across a simulated microservices architecture.

Why This Environment Matters

This is a POMDP (Partially Observable Markov Decision Process). The agent never sees the root cause β€” it sees symptoms: climbing memory metrics, cascading error logs, firing alerts. It must gather evidence, form hypotheses, and act β€” exactly like a real SRE at 3 AM.

Dimension Detail
Observation Alerts, metric timeseries, structured logs, dependency graphs, deploy history
Action space 10 hierarchical action types Γ— 7 target services = rich combinatorics
Difficulty Easy (single-service leak) β†’ Medium (cascading failure) β†’ Hard (distributed deadlock)
Reward Oracle-shaped per-step signal for training + oracle-independent grader for evaluation
Realism Reactive simulation β€” memory climbs over time, cascades propagate, restarts don't fix root causes

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    SIMULATED INFRASTRUCTURE                      β”‚
β”‚                                                                  β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚   β”‚ API GW  │────►│ Auth    │────►│ Orders  │────►│ Payment β”‚ β”‚
β”‚   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β”‚
β”‚        β”‚                               β”‚                β”‚      β”‚
β”‚        β–Ό                               β–Ό                β–Ό      β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚ Cache   β”‚                    β”‚   DB    β”‚     β”‚ Queue   β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

7 services with reactive metrics, logs, alerts, and dependency-aware cascade propagation.


Action Space (Hierarchical)

Level 1: Action Type

Action Category Description
view_alerts Diagnostic See all firing alerts
query_logs Diagnostic Query service logs (with level/keyword filters)
check_metrics Diagnostic Get 30-minute metric timeseries
check_dependencies Diagnostic View upstream/downstream dependency map
check_deploy_history Diagnostic Recent deploys for a service
run_health_check Diagnostic Ping a service for status
restart_service Remediation Restart (fixes symptoms temporarily, not root cause)
rollback_deploy Remediation Rollback to previous deploy version
scale_service Remediation Scale replicas up/down
declare_root_cause Terminal Submit diagnosis β€” ends episode

Level 2: Target Service + Parameters

Targeted actions require target_service from: api_gateway, auth, orders, payment, cache, database, queue.

Action Masking

The observation includes valid_actions[] β€” illegal actions (e.g., rollback on a service with no deploy history) are rejected with a penalty.


Observation Space (POMDP)

The agent never sees: fault_type, is_bad deploy flag, or internal simulation state.

It does see:

  • Incident summary and severity
  • Service statuses (healthy/degraded/down)
  • Active alert count
  • Action result (data from the last action: logs, metrics, alerts, etc.)
  • Valid actions (action mask)
  • Time elapsed / budget (SLA pressure)
  • Cumulative reward and step count

Tasks

Task Description Difficulty Root Cause
memory_leak Orders service OOM from bad deploy Easy Rollback orders deploy v2.3.1
cascading_failure Auth config change cascading to API GW + orders Medium Rollback auth deploy, restart dependents
distributed_deadlock Payment retry change creates circular wait Hard Rollback payment, scale queue, restart orders

Reward Design (Two-Layer)

Layer 1: Per-Step Training Rewards (Oracle-Shaped)

These rewards peek at hidden state to guide RL training:

Action Category Condition Reward
Diagnostic Investigating involved service +0.15
Diagnostic Investigating uninvolved service +0.05
Any Repeating a previous action -0.05
Remediation Correct target (root cause service) +0.30
Remediation Helpful (affected, not root cause) +0.10
Remediation Harmful (healthy service) -0.15
Declaration Correct root cause +0.40
Declaration Wrong root cause -0.20
Any Per-step efficiency penalty -0.02
Completion All services healthy +0.20
Completion Time budget exceeded -0.10

Layer 2: Evaluation Grader (Oracle-Independent)

The grader scores only the trajectory β€” no hidden state access:

Criterion Weight What it measures
Root cause accuracy 40% Did the agent declare the correct root cause?
Remediation quality 30% Did the agent take the right fix actions?
Diagnostic efficiency 20% Fewer steps to diagnosis = better
Service restoration 10% Are all services healthy at episode end?

Quick Start

Local Development

# Install dependencies
cd incident_env
pip install -e .

# Start server
uvicorn incident_env.server.app:app --host 0.0.0.0 --port 8000

# Test endpoints
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{"task_name": "memory_leak"}'
curl -X POST http://localhost:8000/step -H "Content-Type: application/json" -d '{"action_type": "view_alerts"}'

Run Inference

export OPENAI_API_KEY=sk-...
export MODEL_NAME=gpt-4o-mini
export ENV_BASE_URL=http://localhost:8000

python inference.py

Docker

docker build -t incident-env -f server/Dockerfile .
docker run -p 8000:8000 incident-env

Example Agent Interaction

Agent: POST /reset {"task_name": "memory_leak"}
  β†’ Incident triggered: "Orders service experiencing failures..."
  β†’ Services: orders=degraded, rest=healthy

Agent: POST /step {"action_type": "view_alerts"}
  β†’ 3 alerts: orders HighMemoryUsage (critical), orders HighErrorRate, orders HighLatencyP99
  β†’ reward = +0.13

Agent: POST /step {"action_type": "check_metrics", "target_service": "orders"}
  β†’ 30 data points: memory climbing from 35% β†’ 78% over 20 minutes
  β†’ reward = +0.13

Agent: POST /step {"action_type": "check_deploy_history", "target_service": "orders"}
  β†’ 2 deploys: v2.3.1 (20 min ago, "batch order processing") and v1.2.0
  β†’ reward = +0.13

Agent: POST /step {"action_type": "rollback_deploy", "target_service": "orders"}
  β†’ "Rolled back orders from v2.3.1 to v1.2.0 β€” service recovering"
  β†’ reward = +0.28

Agent: POST /step {"action_type": "declare_root_cause", "parameters": {"root_cause": "memory leak in orders caused by bad deploy v2.3.1"}}
  β†’ Episode done. Final grade: 0.97