--- title: SRE Incident Response emoji: "\U0001F6A8" colorFrom: red colorTo: yellow sdk: docker app_port: 8000 tags: - openenv pinned: false --- # SRE Incident Response Environment An OpenEnv-compatible reinforcement learning environment that simulates production incident response. AI agents must investigate microservice architectures, diagnose root causes, and apply fixes — just like a real on-call SRE engineer. ## Motivation Every tech company has on-call rotations, yet there's no standardized benchmark for evaluating AI agents on incident response. This environment fills that gap by simulating realistic production incidents with: - **Multi-service architectures** with dependency chains and cascading failures - **Progressive information revelation** — agents must actively investigate (read logs, check metrics, trace requests) - **Red herrings and misleading symptoms** — alerts point to symptoms, not root causes - **Concurrent faults** in the hardest tier — testing whether agents can find multiple independent root causes - **Realistic operational data** — 50+ log lines per service with noise, time-series metrics, distributed traces, deploy history, runbooks, and config diffs ## Service Architecture All tasks share the same 7-service microservice architecture: ``` +--------------+ +-------->| auth-service |<------+ | +------+-------+ | | | depends | depends +---------+------+ +------v------+ +-----+--------+ | api-gateway | | cache-redis | | notification | | (entry point) | +-------------+ | -service | +-+----------+---+ +--------------+ | | | depends | depends v v +------------+ +-----------------+ |user-service| |payment-service | +-----+------+ +--------+--------+ | depends | depends v v +----------------------------+ | db-postgres | +----------------------------+ ``` Each service has: name, status (`HEALTHY`/`DEGRADED`/`DOWN`), version, replica count, dependencies, logs, metrics, traces, deploy history, config, and runbook data. ## Tasks Tasks are auto-discovered from the `tasks/` directory. Each task is a self-contained Python file defining a `SCENARIO` object. | Task ID | Name | Difficulty | Max Steps | Root Cause | Fix Required | |---------|------|-----------|-----------|------------|--------------| | `easy` | Single Service OOM Crash | Easy | 15 | `auth-service` (OOM) | `restart_service(auth-service)` | | `medium` | Cascading Database Deadlock | Medium | 25 | `db-postgres` (deadlock) | `restart_service(db-postgres)` | | `hard` | Concurrent Faults + Misleading Evidence | Hard | 35 | `payment-service` (bad deploy) AND `cache-redis` (memory leak) | `rollback_deploy(payment-service, v3.8.1)` AND `restart_service(cache-redis)` | ### Task Details **Easy** — Alert directly names `auth-service` as down. Logs clearly show OOM crash cycle (heap growth, OOM kills, restart exhaustion). Single root cause, single fix. **Medium** — Alerts blame `payment-service` and `user-service` (both are victims). The real cause is a long-running analytics query deadlocking `db-postgres`. Agent must notice "writes fail but reads work", follow dependency chain to the database, and read `db-postgres` logs to find the deadlock. Red herring: `cache-redis` miss ratio alert (benign TTL expiry). **Hard** — Two independent faults at the same time: (1) `payment-service` has a bad deploy (v3.8.2, NullPointerException in new validator module), (2) `cache-redis` has a memory leak causing eviction storms that degrade `auth-service`. Red herrings: `user-service` config warnings (benign), `notification-service` queue backup (victim of auth-service). Agent must find and fix BOTH faults. After fixing only one, post-remediation check shows remaining services are still unhealthy. ### Adding New Tasks To add a new task: 1. Create a new file in `tasks/` (e.g., `tasks/my_new_task.py`) 2. Define a `SCENARIO = IncidentScenario(task_id="my_new_task", ...)` — see existing task files for the template 3. Done. The task loader in `tasks/__init__.py` auto-discovers any `.py` file that exports a `SCENARIO` object. No changes needed to the environment engine, grader, server, or inference script. The grader is generic — it reads ground truth (root cause services, required fixes, keywords, weights) from the scenario definition. ## Project Structure ``` IncidentResponse_RL/ ├── models.py # Pydantic models: Action, Observation, State, enums ├── openenv.yaml # OpenEnv manifest (tasks, models, runtime config) ├── requirements.txt # Python dependencies ├── Dockerfile # Container for HF Spaces deployment ├── inference.py # Baseline agent using OpenAI client ├── README.md │ ├── env/ # Core environment engine │ ├── __init__.py │ ├── scenario.py # IncidentScenario, ServiceConfig, RequiredFix dataclasses │ ├── environment.py # step() / reset() / state() implementation │ └── services.py # Alert generation, dependency cascade, data formatting │ ├── tasks/ # Task definitions (auto-discovered) │ ├── __init__.py # Auto-discovery loader → SCENARIOS dict │ ├── easy_oom.py # Easy: Single Service OOM Crash │ ├── medium_deadlock.py # Medium: Cascading Database Deadlock │ └── hard_concurrent.py # Hard: Concurrent Faults + Misleading Evidence │ ├── graders/ # Scoring engine │ ├── __init__.py │ └── grader.py # Generic rubric-based grader (0.0-1.0) │ └── server/ # FastAPI web server ├── __init__.py └── app.py # /reset, /step, /state, /tasks endpoints ``` ## Action Space All actions are sent as a single JSON object with an `action_type` field. Optional fields depend on the action type. ### Investigation Actions (read-only, gather information) | Action | Required Fields | Returns | |--------|----------------|---------| | `read_logs` | `service` | 50+ timestamped log lines with noise and signal | | `check_metrics` | `service` | Time-series table (CPU, memory, latency, error rate, etc.) | | `ping_service` | `service` | Reachability check with latency | | `check_dependencies` | `service` | Upstream dependency list with current health status | | `inspect_deploy` | `service` | Deploy history (version, timestamp, status) | | `query_traces` | `service` | Distributed trace spans showing latency breakdown | | `check_runbook` | `service` | Operational runbook with troubleshooting steps | | `diff_config` | `service` | Current vs previous config comparison | ### Remediation Actions (modify environment state) | Action | Required Fields | Effect | |--------|----------------|--------| | `restart_service` | `service` | Restarts pods. Fixes OOM/leak issues. No effect if root cause is elsewhere. | | `rollback_deploy` | `service`, `target_version` | Rolls back to specified version. Must match exact version string. | | `scale_up` | `service`, `replicas` | Increases replica count. Can alleviate memory pressure. | | `drain_traffic` | `service` | Stops routing traffic to the service. | ### Terminal Action | Action | Required Fields | Effect | |--------|----------------|--------| | `submit_diagnosis` | `root_cause_service`, `root_cause_category`, `fix_description` | Ends episode, triggers grading. | ### Root Cause Categories `oom_crash`, `db_deadlock`, `bad_deploy`, `memory_leak`, `network_partition`, `disk_full`, `config_error`, `cert_expiry`, `dns_failure`, `rate_limit` ### Example Actions ```json {"action_type": "read_logs", "service": "auth-service"} {"action_type": "check_metrics", "service": "db-postgres"} {"action_type": "rollback_deploy", "service": "payment-service", "target_version": "v3.8.1"} {"action_type": "submit_diagnosis", "root_cause_service": "db-postgres", "root_cause_category": "db_deadlock", "fix_description": "Restarted db-postgres to clear deadlock caused by analytics-cron query"} ``` ## Observation Space On `reset()`, the agent receives: - **Service health dashboard** — all 7 services with status (`HEALTHY`/`DEGRADED`/`DOWN`), version, replica count - **Active alerts** — severity-tagged alerts (SEV-1/SEV-2/SEV-3) - **Incident summary** — text description of the situation On each `step()`, the agent receives: - **Updated service statuses** — health may change after remediation - **Updated alerts** — alerts clear when services recover - **Action result** — the data returned by the action (logs, metrics, traces, etc.) - **Reward** — per-step reward signal - **Done flag** — whether the episode has ended - **Score** — final score (only on terminal step) ### Progressive Revelation The agent does NOT see all data upfront. It must actively choose which services to investigate and which data to request. Each investigation action consumes a step, creating a planning pressure: the agent must balance information gathering with remediation within the step budget. ### Post-Remediation Feedback After any remediation action, the observation includes a `[POST-REMEDIATION CHECK]` that lists which services are still unhealthy. This is critical for the hard task — after fixing only one of two faults, the check reveals remaining issues. ## Reward Function ### Per-Step Shaping | Action | Reward | |--------|--------| | Investigating a root-cause service | +0.01 | | Investigating a non-root-cause service | 0.00 | | Correct remediation (matches required fix) | +0.05 | | Wrong remediation (wrong service or wrong fix type) | -0.05 | ### Terminal Grading (0.0 - 1.0) The grader is generic and rubric-based. Each task defines its own weights: | Component | Easy | Medium | Hard | |-----------|------|--------|------| | Correct root cause service identified | 0.30 | 0.25 | 0.15 | | Correct root cause category | 0.20 | 0.20 | 0.10 | | Primary fix applied | 0.30 | 0.25 | 0.15 | | Secondary fix(es) applied | -- | -- | 0.20 | | Diagnosis text quality (keyword match) | 0.10 | 0.10 | 0.15 | | Investigation thoroughness | 0.10 | 0.10 | 0.10 | | Wrong remediation penalty | -0.03/ea | -0.05/ea | -0.05/ea | **Diagnosis text scoring** uses deterministic keyword matching — the grader checks if the fix description mentions key terms (service names, fault types, fix actions). No LLM-based judging. **Investigation thoroughness** checks whether the agent examined at least one root-cause service before submitting. ## Setup ### Local Development ```bash pip install -r requirements.txt python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 ``` ### Docker ```bash docker build -t sre-incident-response . docker run -p 8000:8000 sre-incident-response ``` ### API Usage ```bash # List available tasks curl http://localhost:8000/tasks # Reset (start a new episode) curl -X POST http://localhost:8000/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "easy"}' # Step (take an action) curl -X POST http://localhost:8000/step \ -H "Content-Type: application/json" \ -d '{"session_id": "", "action": {"action_type": "read_logs", "service": "auth-service"}}' # Get current episode state curl http://localhost:8000/state/ ``` OpenEnv-prefixed endpoints are also available: `/openenv/reset`, `/openenv/step`, `/openenv/state/{session_id}`, `/openenv/tasks`. ### Running Inference ```bash export HF_TOKEN=your_token export API_BASE_URL=https://router.huggingface.co/v1 export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct python inference.py ``` The inference script runs a baseline LLM agent against all tasks, emitting structured stdout logs: ``` [START] task=easy env=sre_incident_response model=Qwen/Qwen2.5-72B-Instruct [STEP] step=1 action=read_logs(auth-service) reward=0.01 done=false error=null [STEP] step=2 action=check_metrics(auth-service) reward=0.01 done=false error=null [STEP] step=3 action=restart_service(auth-service) reward=0.05 done=false error=null [STEP] step=4 action=submit_diagnosis reward=1.00 done=true error=null [END] success=true steps=4 score=1.00 rewards=0.01,0.01,0.05,1.00 ``` ## Baseline Scores | Task | Expected Score Range | What a Perfect Agent Scores | |------|---------------------|---------------------------| | easy | 0.70 - 0.95 | 1.00 | | medium | 0.40 - 0.75 | 0.90 | | hard | 0.20 - 0.55 | 0.85 | ## Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` | | `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` | | `HF_TOKEN` | HuggingFace API key | Required | | `PORT` | Server port | `8000` | | `SRE_TASKS` | Comma-separated task IDs to run in inference | `easy,medium,hard` | ## OpenEnv Spec Compliance - `openenv.yaml` with metadata, task definitions, typed models, and runtime config - `step(action)` returns observation, reward, done, info - `reset()` returns initial observation - `state()` returns current episode metadata - Typed Pydantic models for Action, Observation, and State - 3 tasks with programmatic graders (easy, medium, hard) - Scores in 0.0-1.0 range with partial progress signals - Working Dockerfile for containerized execution - Baseline inference script (`inference.py`) with reproducible scores