Spaces:
Sleeping
Sleeping
| title: SRE Incident Response | |
| emoji: "\U0001F6A8" | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: docker | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| pinned: false | |
| # SRE Incident Response Environment | |
| An OpenEnv-compatible reinforcement learning environment that simulates production incident response. AI agents must investigate microservice architectures, diagnose root causes, and apply fixes β just like a real on-call SRE engineer. | |
| ## Motivation | |
| Every tech company has on-call rotations, yet there's no standardized benchmark for evaluating AI agents on incident response. This environment fills that gap by simulating realistic production incidents with: | |
| - **Multi-service architectures** with dependency chains and cascading failures | |
| - **Progressive information revelation** β agents must actively investigate (read logs, check metrics, trace requests) | |
| - **Red herrings and misleading symptoms** β alerts point to symptoms, not root causes | |
| - **Concurrent faults** in the hardest tier β testing whether agents can find multiple independent root causes | |
| - **Realistic operational data** β 50+ log lines per service with noise, time-series metrics, distributed traces, deploy history, runbooks, and config diffs | |
| ## Service Architecture | |
| All tasks share the same 7-service microservice architecture: | |
| ``` | |
| +--------------+ | |
| +-------->| auth-service |<------+ | |
| | +------+-------+ | | |
| | | depends | depends | |
| +---------+------+ +------v------+ +-----+--------+ | |
| | api-gateway | | cache-redis | | notification | | |
| | (entry point) | +-------------+ | -service | | |
| +-+----------+---+ +--------------+ | |
| | | | |
| | depends | depends | |
| v v | |
| +------------+ +-----------------+ | |
| |user-service| |payment-service | | |
| +-----+------+ +--------+--------+ | |
| | depends | depends | |
| v v | |
| +----------------------------+ | |
| | db-postgres | | |
| +----------------------------+ | |
| ``` | |
| Each service has: name, status (`HEALTHY`/`DEGRADED`/`DOWN`), version, replica count, dependencies, logs, metrics, traces, deploy history, config, and runbook data. | |
| ## Tasks | |
| Tasks are auto-discovered from the `tasks/` directory. Each task is a self-contained Python file defining a `SCENARIO` object. | |
| | Task ID | Name | Difficulty | Max Steps | Root Cause | Fix Required | | |
| |---------|------|-----------|-----------|------------|--------------| | |
| | `easy` | Single Service OOM Crash | Easy | 15 | `auth-service` (OOM) | `restart_service(auth-service)` | | |
| | `medium` | Cascading Database Deadlock | Medium | 25 | `db-postgres` (deadlock) | `restart_service(db-postgres)` | | |
| | `hard` | Concurrent Faults + Misleading Evidence | Hard | 35 | `payment-service` (bad deploy) AND `cache-redis` (memory leak) | `rollback_deploy(payment-service, v3.8.1)` AND `restart_service(cache-redis)` | | |
| ### Task Details | |
| **Easy** β Alert directly names `auth-service` as down. Logs clearly show OOM crash cycle (heap growth, OOM kills, restart exhaustion). Single root cause, single fix. | |
| **Medium** β Alerts blame `payment-service` and `user-service` (both are victims). The real cause is a long-running analytics query deadlocking `db-postgres`. Agent must notice "writes fail but reads work", follow dependency chain to the database, and read `db-postgres` logs to find the deadlock. Red herring: `cache-redis` miss ratio alert (benign TTL expiry). | |
| **Hard** β Two independent faults at the same time: (1) `payment-service` has a bad deploy (v3.8.2, NullPointerException in new validator module), (2) `cache-redis` has a memory leak causing eviction storms that degrade `auth-service`. Red herrings: `user-service` config warnings (benign), `notification-service` queue backup (victim of auth-service). Agent must find and fix BOTH faults. After fixing only one, post-remediation check shows remaining services are still unhealthy. | |
| ### Adding New Tasks | |
| To add a new task: | |
| 1. Create a new file in `tasks/` (e.g., `tasks/my_new_task.py`) | |
| 2. Define a `SCENARIO = IncidentScenario(task_id="my_new_task", ...)` β see existing task files for the template | |
| 3. Done. The task loader in `tasks/__init__.py` auto-discovers any `.py` file that exports a `SCENARIO` object. | |
| No changes needed to the environment engine, grader, server, or inference script. The grader is generic β it reads ground truth (root cause services, required fixes, keywords, weights) from the scenario definition. | |
| ## Project Structure | |
| ``` | |
| IncidentResponse_RL/ | |
| βββ models.py # Pydantic models: Action, Observation, State, enums | |
| βββ openenv.yaml # OpenEnv manifest (tasks, models, runtime config) | |
| βββ requirements.txt # Python dependencies | |
| βββ Dockerfile # Container for HF Spaces deployment | |
| βββ inference.py # Baseline agent using OpenAI client | |
| βββ README.md | |
| β | |
| βββ env/ # Core environment engine | |
| β βββ __init__.py | |
| β βββ scenario.py # IncidentScenario, ServiceConfig, RequiredFix dataclasses | |
| β βββ environment.py # step() / reset() / state() implementation | |
| β βββ services.py # Alert generation, dependency cascade, data formatting | |
| β | |
| βββ tasks/ # Task definitions (auto-discovered) | |
| β βββ __init__.py # Auto-discovery loader β SCENARIOS dict | |
| β βββ easy_oom.py # Easy: Single Service OOM Crash | |
| β βββ medium_deadlock.py # Medium: Cascading Database Deadlock | |
| β βββ hard_concurrent.py # Hard: Concurrent Faults + Misleading Evidence | |
| β | |
| βββ graders/ # Scoring engine | |
| β βββ __init__.py | |
| β βββ grader.py # Generic rubric-based grader (0.0-1.0) | |
| β | |
| βββ server/ # FastAPI web server | |
| βββ __init__.py | |
| βββ app.py # /reset, /step, /state, /tasks endpoints | |
| ``` | |
| ## Action Space | |
| All actions are sent as a single JSON object with an `action_type` field. Optional fields depend on the action type. | |
| ### Investigation Actions (read-only, gather information) | |
| | Action | Required Fields | Returns | | |
| |--------|----------------|---------| | |
| | `read_logs` | `service` | 50+ timestamped log lines with noise and signal | | |
| | `check_metrics` | `service` | Time-series table (CPU, memory, latency, error rate, etc.) | | |
| | `ping_service` | `service` | Reachability check with latency | | |
| | `check_dependencies` | `service` | Upstream dependency list with current health status | | |
| | `inspect_deploy` | `service` | Deploy history (version, timestamp, status) | | |
| | `query_traces` | `service` | Distributed trace spans showing latency breakdown | | |
| | `check_runbook` | `service` | Operational runbook with troubleshooting steps | | |
| | `diff_config` | `service` | Current vs previous config comparison | | |
| ### Remediation Actions (modify environment state) | |
| | Action | Required Fields | Effect | | |
| |--------|----------------|--------| | |
| | `restart_service` | `service` | Restarts pods. Fixes OOM/leak issues. No effect if root cause is elsewhere. | | |
| | `rollback_deploy` | `service`, `target_version` | Rolls back to specified version. Must match exact version string. | | |
| | `scale_up` | `service`, `replicas` | Increases replica count. Can alleviate memory pressure. | | |
| | `drain_traffic` | `service` | Stops routing traffic to the service. | | |
| ### Terminal Action | |
| | Action | Required Fields | Effect | | |
| |--------|----------------|--------| | |
| | `submit_diagnosis` | `root_cause_service`, `root_cause_category`, `fix_description` | Ends episode, triggers grading. | | |
| ### Root Cause Categories | |
| `oom_crash`, `db_deadlock`, `bad_deploy`, `memory_leak`, `network_partition`, `disk_full`, `config_error`, `cert_expiry`, `dns_failure`, `rate_limit` | |
| ### Example Actions | |
| ```json | |
| {"action_type": "read_logs", "service": "auth-service"} | |
| {"action_type": "check_metrics", "service": "db-postgres"} | |
| {"action_type": "rollback_deploy", "service": "payment-service", "target_version": "v3.8.1"} | |
| {"action_type": "submit_diagnosis", "root_cause_service": "db-postgres", "root_cause_category": "db_deadlock", "fix_description": "Restarted db-postgres to clear deadlock caused by analytics-cron query"} | |
| ``` | |
| ## Observation Space | |
| On `reset()`, the agent receives: | |
| - **Service health dashboard** β all 7 services with status (`HEALTHY`/`DEGRADED`/`DOWN`), version, replica count | |
| - **Active alerts** β severity-tagged alerts (SEV-1/SEV-2/SEV-3) | |
| - **Incident summary** β text description of the situation | |
| On each `step()`, the agent receives: | |
| - **Updated service statuses** β health may change after remediation | |
| - **Updated alerts** β alerts clear when services recover | |
| - **Action result** β the data returned by the action (logs, metrics, traces, etc.) | |
| - **Reward** β per-step reward signal | |
| - **Done flag** β whether the episode has ended | |
| - **Score** β final score (only on terminal step) | |
| ### Progressive Revelation | |
| The agent does NOT see all data upfront. It must actively choose which services to investigate and which data to request. Each investigation action consumes a step, creating a planning pressure: the agent must balance information gathering with remediation within the step budget. | |
| ### Post-Remediation Feedback | |
| After any remediation action, the observation includes a `[POST-REMEDIATION CHECK]` that lists which services are still unhealthy. This is critical for the hard task β after fixing only one of two faults, the check reveals remaining issues. | |
| ## Reward Function | |
| ### Per-Step Shaping | |
| | Action | Reward | | |
| |--------|--------| | |
| | Investigating a root-cause service | +0.01 | | |
| | Investigating a non-root-cause service | 0.00 | | |
| | Correct remediation (matches required fix) | +0.05 | | |
| | Wrong remediation (wrong service or wrong fix type) | -0.05 | | |
| ### Terminal Grading (0.0 - 1.0) | |
| The grader is generic and rubric-based. Each task defines its own weights: | |
| | Component | Easy | Medium | Hard | | |
| |-----------|------|--------|------| | |
| | Correct root cause service identified | 0.30 | 0.25 | 0.15 | | |
| | Correct root cause category | 0.20 | 0.20 | 0.10 | | |
| | Primary fix applied | 0.30 | 0.25 | 0.15 | | |
| | Secondary fix(es) applied | -- | -- | 0.20 | | |
| | Diagnosis text quality (keyword match) | 0.10 | 0.10 | 0.15 | | |
| | Investigation thoroughness | 0.10 | 0.10 | 0.10 | | |
| | Wrong remediation penalty | -0.03/ea | -0.05/ea | -0.05/ea | | |
| **Diagnosis text scoring** uses deterministic keyword matching β the grader checks if the fix description mentions key terms (service names, fault types, fix actions). No LLM-based judging. | |
| **Investigation thoroughness** checks whether the agent examined at least one root-cause service before submitting. | |
| ## Setup | |
| ### Local Development | |
| ```bash | |
| pip install -r requirements.txt | |
| python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker build -t sre-incident-response . | |
| docker run -p 8000:8000 sre-incident-response | |
| ``` | |
| ### API Usage | |
| ```bash | |
| # List available tasks | |
| curl http://localhost:8000/tasks | |
| # Reset (start a new episode) | |
| curl -X POST http://localhost:8000/reset \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"task_id": "easy"}' | |
| # Step (take an action) | |
| curl -X POST http://localhost:8000/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"session_id": "<SESSION_ID>", "action": {"action_type": "read_logs", "service": "auth-service"}}' | |
| # Get current episode state | |
| curl http://localhost:8000/state/<SESSION_ID> | |
| ``` | |
| OpenEnv-prefixed endpoints are also available: `/openenv/reset`, `/openenv/step`, `/openenv/state/{session_id}`, `/openenv/tasks`. | |
| ### Running Inference | |
| ```bash | |
| export HF_TOKEN=your_token | |
| export API_BASE_URL=https://router.huggingface.co/v1 | |
| export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct | |
| python inference.py | |
| ``` | |
| The inference script runs a baseline LLM agent against all tasks, emitting structured stdout logs: | |
| ``` | |
| [START] task=easy env=sre_incident_response model=Qwen/Qwen2.5-72B-Instruct | |
| [STEP] step=1 action=read_logs(auth-service) reward=0.01 done=false error=null | |
| [STEP] step=2 action=check_metrics(auth-service) reward=0.01 done=false error=null | |
| [STEP] step=3 action=restart_service(auth-service) reward=0.05 done=false error=null | |
| [STEP] step=4 action=submit_diagnosis reward=1.00 done=true error=null | |
| [END] success=true steps=4 score=1.00 rewards=0.01,0.01,0.05,1.00 | |
| ``` | |
| ## Baseline Scores | |
| | Task | Expected Score Range | What a Perfect Agent Scores | | |
| |------|---------------------|---------------------------| | |
| | easy | 0.70 - 0.95 | 1.00 | | |
| | medium | 0.40 - 0.75 | 0.90 | | |
| | hard | 0.20 - 0.55 | 0.85 | | |
| ## Environment Variables | |
| | Variable | Description | Default | | |
| |----------|-------------|---------| | |
| | `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` | | |
| | `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` | | |
| | `HF_TOKEN` | HuggingFace API key | Required | | |
| | `PORT` | Server port | `8000` | | |
| | `SRE_TASKS` | Comma-separated task IDs to run in inference | `easy,medium,hard` | | |
| ## OpenEnv Spec Compliance | |
| - `openenv.yaml` with metadata, task definitions, typed models, and runtime config | |
| - `step(action)` returns observation, reward, done, info | |
| - `reset()` returns initial observation | |
| - `state()` returns current episode metadata | |
| - Typed Pydantic models for Action, Observation, and State | |
| - 3 tasks with programmatic graders (easy, medium, hard) | |
| - Scores in 0.0-1.0 range with partial progress signals | |
| - Working Dockerfile for containerized execution | |
| - Baseline inference script (`inference.py`) with reproducible scores | |