Spaces:
Sleeping
Sleeping
File size: 13,629 Bytes
b91fdb0 636a5fe b91fdb0 5fe9036 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 | ---
title: SRE Incident Response
emoji: "\U0001F6A8"
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
- openenv
pinned: false
---
# SRE Incident Response Environment
An OpenEnv-compatible reinforcement learning environment that simulates production incident response. AI agents must investigate microservice architectures, diagnose root causes, and apply fixes β just like a real on-call SRE engineer.
## Motivation
Every tech company has on-call rotations, yet there's no standardized benchmark for evaluating AI agents on incident response. This environment fills that gap by simulating realistic production incidents with:
- **Multi-service architectures** with dependency chains and cascading failures
- **Progressive information revelation** β agents must actively investigate (read logs, check metrics, trace requests)
- **Red herrings and misleading symptoms** β alerts point to symptoms, not root causes
- **Concurrent faults** in the hardest tier β testing whether agents can find multiple independent root causes
- **Realistic operational data** β 50+ log lines per service with noise, time-series metrics, distributed traces, deploy history, runbooks, and config diffs
## Service Architecture
All tasks share the same 7-service microservice architecture:
```
+--------------+
+-------->| auth-service |<------+
| +------+-------+ |
| | depends | depends
+---------+------+ +------v------+ +-----+--------+
| api-gateway | | cache-redis | | notification |
| (entry point) | +-------------+ | -service |
+-+----------+---+ +--------------+
| |
| depends | depends
v v
+------------+ +-----------------+
|user-service| |payment-service |
+-----+------+ +--------+--------+
| depends | depends
v v
+----------------------------+
| db-postgres |
+----------------------------+
```
Each service has: name, status (`HEALTHY`/`DEGRADED`/`DOWN`), version, replica count, dependencies, logs, metrics, traces, deploy history, config, and runbook data.
## Tasks
Tasks are auto-discovered from the `tasks/` directory. Each task is a self-contained Python file defining a `SCENARIO` object.
| Task ID | Name | Difficulty | Max Steps | Root Cause | Fix Required |
|---------|------|-----------|-----------|------------|--------------|
| `easy` | Single Service OOM Crash | Easy | 15 | `auth-service` (OOM) | `restart_service(auth-service)` |
| `medium` | Cascading Database Deadlock | Medium | 25 | `db-postgres` (deadlock) | `restart_service(db-postgres)` |
| `hard` | Concurrent Faults + Misleading Evidence | Hard | 35 | `payment-service` (bad deploy) AND `cache-redis` (memory leak) | `rollback_deploy(payment-service, v3.8.1)` AND `restart_service(cache-redis)` |
### Task Details
**Easy** β Alert directly names `auth-service` as down. Logs clearly show OOM crash cycle (heap growth, OOM kills, restart exhaustion). Single root cause, single fix.
**Medium** β Alerts blame `payment-service` and `user-service` (both are victims). The real cause is a long-running analytics query deadlocking `db-postgres`. Agent must notice "writes fail but reads work", follow dependency chain to the database, and read `db-postgres` logs to find the deadlock. Red herring: `cache-redis` miss ratio alert (benign TTL expiry).
**Hard** β Two independent faults at the same time: (1) `payment-service` has a bad deploy (v3.8.2, NullPointerException in new validator module), (2) `cache-redis` has a memory leak causing eviction storms that degrade `auth-service`. Red herrings: `user-service` config warnings (benign), `notification-service` queue backup (victim of auth-service). Agent must find and fix BOTH faults. After fixing only one, post-remediation check shows remaining services are still unhealthy.
### Adding New Tasks
To add a new task:
1. Create a new file in `tasks/` (e.g., `tasks/my_new_task.py`)
2. Define a `SCENARIO = IncidentScenario(task_id="my_new_task", ...)` β see existing task files for the template
3. Done. The task loader in `tasks/__init__.py` auto-discovers any `.py` file that exports a `SCENARIO` object.
No changes needed to the environment engine, grader, server, or inference script. The grader is generic β it reads ground truth (root cause services, required fixes, keywords, weights) from the scenario definition.
## Project Structure
```
IncidentResponse_RL/
βββ models.py # Pydantic models: Action, Observation, State, enums
βββ openenv.yaml # OpenEnv manifest (tasks, models, runtime config)
βββ requirements.txt # Python dependencies
βββ Dockerfile # Container for HF Spaces deployment
βββ inference.py # Baseline agent using OpenAI client
βββ README.md
β
βββ env/ # Core environment engine
β βββ __init__.py
β βββ scenario.py # IncidentScenario, ServiceConfig, RequiredFix dataclasses
β βββ environment.py # step() / reset() / state() implementation
β βββ services.py # Alert generation, dependency cascade, data formatting
β
βββ tasks/ # Task definitions (auto-discovered)
β βββ __init__.py # Auto-discovery loader β SCENARIOS dict
β βββ easy_oom.py # Easy: Single Service OOM Crash
β βββ medium_deadlock.py # Medium: Cascading Database Deadlock
β βββ hard_concurrent.py # Hard: Concurrent Faults + Misleading Evidence
β
βββ graders/ # Scoring engine
β βββ __init__.py
β βββ grader.py # Generic rubric-based grader (0.0-1.0)
β
βββ server/ # FastAPI web server
βββ __init__.py
βββ app.py # /reset, /step, /state, /tasks endpoints
```
## Action Space
All actions are sent as a single JSON object with an `action_type` field. Optional fields depend on the action type.
### Investigation Actions (read-only, gather information)
| Action | Required Fields | Returns |
|--------|----------------|---------|
| `read_logs` | `service` | 50+ timestamped log lines with noise and signal |
| `check_metrics` | `service` | Time-series table (CPU, memory, latency, error rate, etc.) |
| `ping_service` | `service` | Reachability check with latency |
| `check_dependencies` | `service` | Upstream dependency list with current health status |
| `inspect_deploy` | `service` | Deploy history (version, timestamp, status) |
| `query_traces` | `service` | Distributed trace spans showing latency breakdown |
| `check_runbook` | `service` | Operational runbook with troubleshooting steps |
| `diff_config` | `service` | Current vs previous config comparison |
### Remediation Actions (modify environment state)
| Action | Required Fields | Effect |
|--------|----------------|--------|
| `restart_service` | `service` | Restarts pods. Fixes OOM/leak issues. No effect if root cause is elsewhere. |
| `rollback_deploy` | `service`, `target_version` | Rolls back to specified version. Must match exact version string. |
| `scale_up` | `service`, `replicas` | Increases replica count. Can alleviate memory pressure. |
| `drain_traffic` | `service` | Stops routing traffic to the service. |
### Terminal Action
| Action | Required Fields | Effect |
|--------|----------------|--------|
| `submit_diagnosis` | `root_cause_service`, `root_cause_category`, `fix_description` | Ends episode, triggers grading. |
### Root Cause Categories
`oom_crash`, `db_deadlock`, `bad_deploy`, `memory_leak`, `network_partition`, `disk_full`, `config_error`, `cert_expiry`, `dns_failure`, `rate_limit`
### Example Actions
```json
{"action_type": "read_logs", "service": "auth-service"}
{"action_type": "check_metrics", "service": "db-postgres"}
{"action_type": "rollback_deploy", "service": "payment-service", "target_version": "v3.8.1"}
{"action_type": "submit_diagnosis", "root_cause_service": "db-postgres", "root_cause_category": "db_deadlock", "fix_description": "Restarted db-postgres to clear deadlock caused by analytics-cron query"}
```
## Observation Space
On `reset()`, the agent receives:
- **Service health dashboard** β all 7 services with status (`HEALTHY`/`DEGRADED`/`DOWN`), version, replica count
- **Active alerts** β severity-tagged alerts (SEV-1/SEV-2/SEV-3)
- **Incident summary** β text description of the situation
On each `step()`, the agent receives:
- **Updated service statuses** β health may change after remediation
- **Updated alerts** β alerts clear when services recover
- **Action result** β the data returned by the action (logs, metrics, traces, etc.)
- **Reward** β per-step reward signal
- **Done flag** β whether the episode has ended
- **Score** β final score (only on terminal step)
### Progressive Revelation
The agent does NOT see all data upfront. It must actively choose which services to investigate and which data to request. Each investigation action consumes a step, creating a planning pressure: the agent must balance information gathering with remediation within the step budget.
### Post-Remediation Feedback
After any remediation action, the observation includes a `[POST-REMEDIATION CHECK]` that lists which services are still unhealthy. This is critical for the hard task β after fixing only one of two faults, the check reveals remaining issues.
## Reward Function
### Per-Step Shaping
| Action | Reward |
|--------|--------|
| Investigating a root-cause service | +0.01 |
| Investigating a non-root-cause service | 0.00 |
| Correct remediation (matches required fix) | +0.05 |
| Wrong remediation (wrong service or wrong fix type) | -0.05 |
### Terminal Grading (0.0 - 1.0)
The grader is generic and rubric-based. Each task defines its own weights:
| Component | Easy | Medium | Hard |
|-----------|------|--------|------|
| Correct root cause service identified | 0.30 | 0.25 | 0.15 |
| Correct root cause category | 0.20 | 0.20 | 0.10 |
| Primary fix applied | 0.30 | 0.25 | 0.15 |
| Secondary fix(es) applied | -- | -- | 0.20 |
| Diagnosis text quality (keyword match) | 0.10 | 0.10 | 0.15 |
| Investigation thoroughness | 0.10 | 0.10 | 0.10 |
| Wrong remediation penalty | -0.03/ea | -0.05/ea | -0.05/ea |
**Diagnosis text scoring** uses deterministic keyword matching β the grader checks if the fix description mentions key terms (service names, fault types, fix actions). No LLM-based judging.
**Investigation thoroughness** checks whether the agent examined at least one root-cause service before submitting.
## Setup
### Local Development
```bash
pip install -r requirements.txt
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Docker
```bash
docker build -t sre-incident-response .
docker run -p 8000:8000 sre-incident-response
```
### API Usage
```bash
# List available tasks
curl http://localhost:8000/tasks
# Reset (start a new episode)
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "easy"}'
# Step (take an action)
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"session_id": "<SESSION_ID>", "action": {"action_type": "read_logs", "service": "auth-service"}}'
# Get current episode state
curl http://localhost:8000/state/<SESSION_ID>
```
OpenEnv-prefixed endpoints are also available: `/openenv/reset`, `/openenv/step`, `/openenv/state/{session_id}`, `/openenv/tasks`.
### Running Inference
```bash
export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
python inference.py
```
The inference script runs a baseline LLM agent against all tasks, emitting structured stdout logs:
```
[START] task=easy env=sre_incident_response model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=read_logs(auth-service) reward=0.01 done=false error=null
[STEP] step=2 action=check_metrics(auth-service) reward=0.01 done=false error=null
[STEP] step=3 action=restart_service(auth-service) reward=0.05 done=false error=null
[STEP] step=4 action=submit_diagnosis reward=1.00 done=true error=null
[END] success=true steps=4 score=1.00 rewards=0.01,0.01,0.05,1.00
```
## Baseline Scores
| Task | Expected Score Range | What a Perfect Agent Scores |
|------|---------------------|---------------------------|
| easy | 0.70 - 0.95 | 1.00 |
| medium | 0.40 - 0.75 | 0.90 |
| hard | 0.20 - 0.55 | 0.85 |
## Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
| `HF_TOKEN` | HuggingFace API key | Required |
| `PORT` | Server port | `8000` |
| `SRE_TASKS` | Comma-separated task IDs to run in inference | `easy,medium,hard` |
## OpenEnv Spec Compliance
- `openenv.yaml` with metadata, task definitions, typed models, and runtime config
- `step(action)` returns observation, reward, done, info
- `reset()` returns initial observation
- `state()` returns current episode metadata
- Typed Pydantic models for Action, Observation, and State
- 3 tasks with programmatic graders (easy, medium, hard)
- Scores in 0.0-1.0 range with partial progress signals
- Working Dockerfile for containerized execution
- Baseline inference script (`inference.py`) with reproducible scores
|