Sre-Validation / README.md
abdur0001's picture
fix: colorTo color
636a5fe
metadata
title: SRE Incident Response
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
  - openenv
pinned: false

SRE Incident Response Environment

An OpenEnv-compatible reinforcement learning environment that simulates production incident response. AI agents must investigate microservice architectures, diagnose root causes, and apply fixes β€” just like a real on-call SRE engineer.

Motivation

Every tech company has on-call rotations, yet there's no standardized benchmark for evaluating AI agents on incident response. This environment fills that gap by simulating realistic production incidents with:

  • Multi-service architectures with dependency chains and cascading failures
  • Progressive information revelation β€” agents must actively investigate (read logs, check metrics, trace requests)
  • Red herrings and misleading symptoms β€” alerts point to symptoms, not root causes
  • Concurrent faults in the hardest tier β€” testing whether agents can find multiple independent root causes
  • Realistic operational data β€” 50+ log lines per service with noise, time-series metrics, distributed traces, deploy history, runbooks, and config diffs

Service Architecture

All tasks share the same 7-service microservice architecture:

                    +--------------+
          +-------->| auth-service |<------+
          |         +------+-------+       |
          |                | depends       | depends
+---------+------+  +------v------+  +-----+--------+
|  api-gateway   |  | cache-redis |  | notification |
|  (entry point) |  +-------------+  |   -service   |
+-+----------+---+                   +--------------+
  |          |
  | depends  | depends
  v          v
+------------+  +-----------------+
|user-service|  |payment-service  |
+-----+------+  +--------+--------+
      | depends          | depends
      v                  v
+----------------------------+
|        db-postgres         |
+----------------------------+

Each service has: name, status (HEALTHY/DEGRADED/DOWN), version, replica count, dependencies, logs, metrics, traces, deploy history, config, and runbook data.

Tasks

Tasks are auto-discovered from the tasks/ directory. Each task is a self-contained Python file defining a SCENARIO object.

Task ID Name Difficulty Max Steps Root Cause Fix Required
easy Single Service OOM Crash Easy 15 auth-service (OOM) restart_service(auth-service)
medium Cascading Database Deadlock Medium 25 db-postgres (deadlock) restart_service(db-postgres)
hard Concurrent Faults + Misleading Evidence Hard 35 payment-service (bad deploy) AND cache-redis (memory leak) rollback_deploy(payment-service, v3.8.1) AND restart_service(cache-redis)

Task Details

Easy β€” Alert directly names auth-service as down. Logs clearly show OOM crash cycle (heap growth, OOM kills, restart exhaustion). Single root cause, single fix.

Medium β€” Alerts blame payment-service and user-service (both are victims). The real cause is a long-running analytics query deadlocking db-postgres. Agent must notice "writes fail but reads work", follow dependency chain to the database, and read db-postgres logs to find the deadlock. Red herring: cache-redis miss ratio alert (benign TTL expiry).

Hard β€” Two independent faults at the same time: (1) payment-service has a bad deploy (v3.8.2, NullPointerException in new validator module), (2) cache-redis has a memory leak causing eviction storms that degrade auth-service. Red herrings: user-service config warnings (benign), notification-service queue backup (victim of auth-service). Agent must find and fix BOTH faults. After fixing only one, post-remediation check shows remaining services are still unhealthy.

Adding New Tasks

To add a new task:

  1. Create a new file in tasks/ (e.g., tasks/my_new_task.py)
  2. Define a SCENARIO = IncidentScenario(task_id="my_new_task", ...) β€” see existing task files for the template
  3. Done. The task loader in tasks/__init__.py auto-discovers any .py file that exports a SCENARIO object.

No changes needed to the environment engine, grader, server, or inference script. The grader is generic β€” it reads ground truth (root cause services, required fixes, keywords, weights) from the scenario definition.

Project Structure

IncidentResponse_RL/
β”œβ”€β”€ models.py                  # Pydantic models: Action, Observation, State, enums
β”œβ”€β”€ openenv.yaml               # OpenEnv manifest (tasks, models, runtime config)
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ Dockerfile                 # Container for HF Spaces deployment
β”œβ”€β”€ inference.py               # Baseline agent using OpenAI client
β”œβ”€β”€ README.md
β”‚
β”œβ”€β”€ env/                       # Core environment engine
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ scenario.py            # IncidentScenario, ServiceConfig, RequiredFix dataclasses
β”‚   β”œβ”€β”€ environment.py         # step() / reset() / state() implementation
β”‚   └── services.py            # Alert generation, dependency cascade, data formatting
β”‚
β”œβ”€β”€ tasks/                     # Task definitions (auto-discovered)
β”‚   β”œβ”€β”€ __init__.py            # Auto-discovery loader β†’ SCENARIOS dict
β”‚   β”œβ”€β”€ easy_oom.py            # Easy: Single Service OOM Crash
β”‚   β”œβ”€β”€ medium_deadlock.py     # Medium: Cascading Database Deadlock
β”‚   └── hard_concurrent.py     # Hard: Concurrent Faults + Misleading Evidence
β”‚
β”œβ”€β”€ graders/                   # Scoring engine
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── grader.py              # Generic rubric-based grader (0.0-1.0)
β”‚
└── server/                    # FastAPI web server
    β”œβ”€β”€ __init__.py
    └── app.py                 # /reset, /step, /state, /tasks endpoints

Action Space

All actions are sent as a single JSON object with an action_type field. Optional fields depend on the action type.

Investigation Actions (read-only, gather information)

Action Required Fields Returns
read_logs service 50+ timestamped log lines with noise and signal
check_metrics service Time-series table (CPU, memory, latency, error rate, etc.)
ping_service service Reachability check with latency
check_dependencies service Upstream dependency list with current health status
inspect_deploy service Deploy history (version, timestamp, status)
query_traces service Distributed trace spans showing latency breakdown
check_runbook service Operational runbook with troubleshooting steps
diff_config service Current vs previous config comparison

Remediation Actions (modify environment state)

Action Required Fields Effect
restart_service service Restarts pods. Fixes OOM/leak issues. No effect if root cause is elsewhere.
rollback_deploy service, target_version Rolls back to specified version. Must match exact version string.
scale_up service, replicas Increases replica count. Can alleviate memory pressure.
drain_traffic service Stops routing traffic to the service.

Terminal Action

Action Required Fields Effect
submit_diagnosis root_cause_service, root_cause_category, fix_description Ends episode, triggers grading.

Root Cause Categories

oom_crash, db_deadlock, bad_deploy, memory_leak, network_partition, disk_full, config_error, cert_expiry, dns_failure, rate_limit

Example Actions

{"action_type": "read_logs", "service": "auth-service"}
{"action_type": "check_metrics", "service": "db-postgres"}
{"action_type": "rollback_deploy", "service": "payment-service", "target_version": "v3.8.1"}
{"action_type": "submit_diagnosis", "root_cause_service": "db-postgres", "root_cause_category": "db_deadlock", "fix_description": "Restarted db-postgres to clear deadlock caused by analytics-cron query"}

Observation Space

On reset(), the agent receives:

  • Service health dashboard β€” all 7 services with status (HEALTHY/DEGRADED/DOWN), version, replica count
  • Active alerts β€” severity-tagged alerts (SEV-1/SEV-2/SEV-3)
  • Incident summary β€” text description of the situation

On each step(), the agent receives:

  • Updated service statuses β€” health may change after remediation
  • Updated alerts β€” alerts clear when services recover
  • Action result β€” the data returned by the action (logs, metrics, traces, etc.)
  • Reward β€” per-step reward signal
  • Done flag β€” whether the episode has ended
  • Score β€” final score (only on terminal step)

Progressive Revelation

The agent does NOT see all data upfront. It must actively choose which services to investigate and which data to request. Each investigation action consumes a step, creating a planning pressure: the agent must balance information gathering with remediation within the step budget.

Post-Remediation Feedback

After any remediation action, the observation includes a [POST-REMEDIATION CHECK] that lists which services are still unhealthy. This is critical for the hard task β€” after fixing only one of two faults, the check reveals remaining issues.

Reward Function

Per-Step Shaping

Action Reward
Investigating a root-cause service +0.01
Investigating a non-root-cause service 0.00
Correct remediation (matches required fix) +0.05
Wrong remediation (wrong service or wrong fix type) -0.05

Terminal Grading (0.0 - 1.0)

The grader is generic and rubric-based. Each task defines its own weights:

Component Easy Medium Hard
Correct root cause service identified 0.30 0.25 0.15
Correct root cause category 0.20 0.20 0.10
Primary fix applied 0.30 0.25 0.15
Secondary fix(es) applied -- -- 0.20
Diagnosis text quality (keyword match) 0.10 0.10 0.15
Investigation thoroughness 0.10 0.10 0.10
Wrong remediation penalty -0.03/ea -0.05/ea -0.05/ea

Diagnosis text scoring uses deterministic keyword matching β€” the grader checks if the fix description mentions key terms (service names, fault types, fix actions). No LLM-based judging.

Investigation thoroughness checks whether the agent examined at least one root-cause service before submitting.

Setup

Local Development

pip install -r requirements.txt
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000

Docker

docker build -t sre-incident-response .
docker run -p 8000:8000 sre-incident-response

API Usage

# List available tasks
curl http://localhost:8000/tasks

# Reset (start a new episode)
curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "easy"}'

# Step (take an action)
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"session_id": "<SESSION_ID>", "action": {"action_type": "read_logs", "service": "auth-service"}}'

# Get current episode state
curl http://localhost:8000/state/<SESSION_ID>

OpenEnv-prefixed endpoints are also available: /openenv/reset, /openenv/step, /openenv/state/{session_id}, /openenv/tasks.

Running Inference

export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct

python inference.py

The inference script runs a baseline LLM agent against all tasks, emitting structured stdout logs:

[START] task=easy env=sre_incident_response model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=read_logs(auth-service) reward=0.01 done=false error=null
[STEP] step=2 action=check_metrics(auth-service) reward=0.01 done=false error=null
[STEP] step=3 action=restart_service(auth-service) reward=0.05 done=false error=null
[STEP] step=4 action=submit_diagnosis reward=1.00 done=true error=null
[END] success=true steps=4 score=1.00 rewards=0.01,0.01,0.05,1.00

Baseline Scores

Task Expected Score Range What a Perfect Agent Scores
easy 0.70 - 0.95 1.00
medium 0.40 - 0.75 0.90
hard 0.20 - 0.55 0.85

Environment Variables

Variable Description Default
API_BASE_URL LLM API endpoint https://router.huggingface.co/v1
MODEL_NAME Model identifier Qwen/Qwen2.5-72B-Instruct
HF_TOKEN HuggingFace API key Required
PORT Server port 8000
SRE_TASKS Comma-separated task IDs to run in inference easy,medium,hard

OpenEnv Spec Compliance

  • openenv.yaml with metadata, task definitions, typed models, and runtime config
  • step(action) returns observation, reward, done, info
  • reset() returns initial observation
  • state() returns current episode metadata
  • Typed Pydantic models for Action, Observation, and State
  • 3 tasks with programmatic graders (easy, medium, hard)
  • Scores in 0.0-1.0 range with partial progress signals
  • Working Dockerfile for containerized execution
  • Baseline inference script (inference.py) with reproducible scores