File size: 13,629 Bytes
b91fdb0
 
 
 
636a5fe
b91fdb0
 
 
 
 
 
 
5fe9036
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
title: SRE Incident Response
emoji: "\U0001F6A8"
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
  - openenv
pinned: false
---

# SRE Incident Response Environment

An OpenEnv-compatible reinforcement learning environment that simulates production incident response. AI agents must investigate microservice architectures, diagnose root causes, and apply fixes β€” just like a real on-call SRE engineer.

## Motivation

Every tech company has on-call rotations, yet there's no standardized benchmark for evaluating AI agents on incident response. This environment fills that gap by simulating realistic production incidents with:

- **Multi-service architectures** with dependency chains and cascading failures
- **Progressive information revelation** β€” agents must actively investigate (read logs, check metrics, trace requests)
- **Red herrings and misleading symptoms** β€” alerts point to symptoms, not root causes
- **Concurrent faults** in the hardest tier β€” testing whether agents can find multiple independent root causes
- **Realistic operational data** β€” 50+ log lines per service with noise, time-series metrics, distributed traces, deploy history, runbooks, and config diffs

## Service Architecture

All tasks share the same 7-service microservice architecture:

```
                    +--------------+
          +-------->| auth-service |<------+
          |         +------+-------+       |
          |                | depends       | depends
+---------+------+  +------v------+  +-----+--------+
|  api-gateway   |  | cache-redis |  | notification |
|  (entry point) |  +-------------+  |   -service   |
+-+----------+---+                   +--------------+
  |          |
  | depends  | depends
  v          v
+------------+  +-----------------+
|user-service|  |payment-service  |
+-----+------+  +--------+--------+
      | depends          | depends
      v                  v
+----------------------------+
|        db-postgres         |
+----------------------------+
```

Each service has: name, status (`HEALTHY`/`DEGRADED`/`DOWN`), version, replica count, dependencies, logs, metrics, traces, deploy history, config, and runbook data.

## Tasks

Tasks are auto-discovered from the `tasks/` directory. Each task is a self-contained Python file defining a `SCENARIO` object.

| Task ID | Name | Difficulty | Max Steps | Root Cause | Fix Required |
|---------|------|-----------|-----------|------------|--------------|
| `easy` | Single Service OOM Crash | Easy | 15 | `auth-service` (OOM) | `restart_service(auth-service)` |
| `medium` | Cascading Database Deadlock | Medium | 25 | `db-postgres` (deadlock) | `restart_service(db-postgres)` |
| `hard` | Concurrent Faults + Misleading Evidence | Hard | 35 | `payment-service` (bad deploy) AND `cache-redis` (memory leak) | `rollback_deploy(payment-service, v3.8.1)` AND `restart_service(cache-redis)` |

### Task Details

**Easy** β€” Alert directly names `auth-service` as down. Logs clearly show OOM crash cycle (heap growth, OOM kills, restart exhaustion). Single root cause, single fix.

**Medium** β€” Alerts blame `payment-service` and `user-service` (both are victims). The real cause is a long-running analytics query deadlocking `db-postgres`. Agent must notice "writes fail but reads work", follow dependency chain to the database, and read `db-postgres` logs to find the deadlock. Red herring: `cache-redis` miss ratio alert (benign TTL expiry).

**Hard** β€” Two independent faults at the same time: (1) `payment-service` has a bad deploy (v3.8.2, NullPointerException in new validator module), (2) `cache-redis` has a memory leak causing eviction storms that degrade `auth-service`. Red herrings: `user-service` config warnings (benign), `notification-service` queue backup (victim of auth-service). Agent must find and fix BOTH faults. After fixing only one, post-remediation check shows remaining services are still unhealthy.

### Adding New Tasks

To add a new task:

1. Create a new file in `tasks/` (e.g., `tasks/my_new_task.py`)
2. Define a `SCENARIO = IncidentScenario(task_id="my_new_task", ...)` β€” see existing task files for the template
3. Done. The task loader in `tasks/__init__.py` auto-discovers any `.py` file that exports a `SCENARIO` object.

No changes needed to the environment engine, grader, server, or inference script. The grader is generic β€” it reads ground truth (root cause services, required fixes, keywords, weights) from the scenario definition.

## Project Structure

```
IncidentResponse_RL/
β”œβ”€β”€ models.py                  # Pydantic models: Action, Observation, State, enums
β”œβ”€β”€ openenv.yaml               # OpenEnv manifest (tasks, models, runtime config)
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ Dockerfile                 # Container for HF Spaces deployment
β”œβ”€β”€ inference.py               # Baseline agent using OpenAI client
β”œβ”€β”€ README.md
β”‚
β”œβ”€β”€ env/                       # Core environment engine
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ scenario.py            # IncidentScenario, ServiceConfig, RequiredFix dataclasses
β”‚   β”œβ”€β”€ environment.py         # step() / reset() / state() implementation
β”‚   └── services.py            # Alert generation, dependency cascade, data formatting
β”‚
β”œβ”€β”€ tasks/                     # Task definitions (auto-discovered)
β”‚   β”œβ”€β”€ __init__.py            # Auto-discovery loader β†’ SCENARIOS dict
β”‚   β”œβ”€β”€ easy_oom.py            # Easy: Single Service OOM Crash
β”‚   β”œβ”€β”€ medium_deadlock.py     # Medium: Cascading Database Deadlock
β”‚   └── hard_concurrent.py     # Hard: Concurrent Faults + Misleading Evidence
β”‚
β”œβ”€β”€ graders/                   # Scoring engine
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── grader.py              # Generic rubric-based grader (0.0-1.0)
β”‚
└── server/                    # FastAPI web server
    β”œβ”€β”€ __init__.py
    └── app.py                 # /reset, /step, /state, /tasks endpoints
```

## Action Space

All actions are sent as a single JSON object with an `action_type` field. Optional fields depend on the action type.

### Investigation Actions (read-only, gather information)

| Action | Required Fields | Returns |
|--------|----------------|---------|
| `read_logs` | `service` | 50+ timestamped log lines with noise and signal |
| `check_metrics` | `service` | Time-series table (CPU, memory, latency, error rate, etc.) |
| `ping_service` | `service` | Reachability check with latency |
| `check_dependencies` | `service` | Upstream dependency list with current health status |
| `inspect_deploy` | `service` | Deploy history (version, timestamp, status) |
| `query_traces` | `service` | Distributed trace spans showing latency breakdown |
| `check_runbook` | `service` | Operational runbook with troubleshooting steps |
| `diff_config` | `service` | Current vs previous config comparison |

### Remediation Actions (modify environment state)

| Action | Required Fields | Effect |
|--------|----------------|--------|
| `restart_service` | `service` | Restarts pods. Fixes OOM/leak issues. No effect if root cause is elsewhere. |
| `rollback_deploy` | `service`, `target_version` | Rolls back to specified version. Must match exact version string. |
| `scale_up` | `service`, `replicas` | Increases replica count. Can alleviate memory pressure. |
| `drain_traffic` | `service` | Stops routing traffic to the service. |

### Terminal Action

| Action | Required Fields | Effect |
|--------|----------------|--------|
| `submit_diagnosis` | `root_cause_service`, `root_cause_category`, `fix_description` | Ends episode, triggers grading. |

### Root Cause Categories

`oom_crash`, `db_deadlock`, `bad_deploy`, `memory_leak`, `network_partition`, `disk_full`, `config_error`, `cert_expiry`, `dns_failure`, `rate_limit`

### Example Actions

```json
{"action_type": "read_logs", "service": "auth-service"}
{"action_type": "check_metrics", "service": "db-postgres"}
{"action_type": "rollback_deploy", "service": "payment-service", "target_version": "v3.8.1"}
{"action_type": "submit_diagnosis", "root_cause_service": "db-postgres", "root_cause_category": "db_deadlock", "fix_description": "Restarted db-postgres to clear deadlock caused by analytics-cron query"}
```

## Observation Space

On `reset()`, the agent receives:
- **Service health dashboard** β€” all 7 services with status (`HEALTHY`/`DEGRADED`/`DOWN`), version, replica count
- **Active alerts** β€” severity-tagged alerts (SEV-1/SEV-2/SEV-3)
- **Incident summary** β€” text description of the situation

On each `step()`, the agent receives:
- **Updated service statuses** β€” health may change after remediation
- **Updated alerts** β€” alerts clear when services recover
- **Action result** β€” the data returned by the action (logs, metrics, traces, etc.)
- **Reward** β€” per-step reward signal
- **Done flag** β€” whether the episode has ended
- **Score** β€” final score (only on terminal step)

### Progressive Revelation

The agent does NOT see all data upfront. It must actively choose which services to investigate and which data to request. Each investigation action consumes a step, creating a planning pressure: the agent must balance information gathering with remediation within the step budget.

### Post-Remediation Feedback

After any remediation action, the observation includes a `[POST-REMEDIATION CHECK]` that lists which services are still unhealthy. This is critical for the hard task β€” after fixing only one of two faults, the check reveals remaining issues.

## Reward Function

### Per-Step Shaping

| Action | Reward |
|--------|--------|
| Investigating a root-cause service | +0.01 |
| Investigating a non-root-cause service | 0.00 |
| Correct remediation (matches required fix) | +0.05 |
| Wrong remediation (wrong service or wrong fix type) | -0.05 |

### Terminal Grading (0.0 - 1.0)

The grader is generic and rubric-based. Each task defines its own weights:

| Component | Easy | Medium | Hard |
|-----------|------|--------|------|
| Correct root cause service identified | 0.30 | 0.25 | 0.15 |
| Correct root cause category | 0.20 | 0.20 | 0.10 |
| Primary fix applied | 0.30 | 0.25 | 0.15 |
| Secondary fix(es) applied | -- | -- | 0.20 |
| Diagnosis text quality (keyword match) | 0.10 | 0.10 | 0.15 |
| Investigation thoroughness | 0.10 | 0.10 | 0.10 |
| Wrong remediation penalty | -0.03/ea | -0.05/ea | -0.05/ea |

**Diagnosis text scoring** uses deterministic keyword matching β€” the grader checks if the fix description mentions key terms (service names, fault types, fix actions). No LLM-based judging.

**Investigation thoroughness** checks whether the agent examined at least one root-cause service before submitting.

## Setup

### Local Development

```bash
pip install -r requirements.txt
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
```

### Docker

```bash
docker build -t sre-incident-response .
docker run -p 8000:8000 sre-incident-response
```

### API Usage

```bash
# List available tasks
curl http://localhost:8000/tasks

# Reset (start a new episode)
curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "easy"}'

# Step (take an action)
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"session_id": "<SESSION_ID>", "action": {"action_type": "read_logs", "service": "auth-service"}}'

# Get current episode state
curl http://localhost:8000/state/<SESSION_ID>
```

OpenEnv-prefixed endpoints are also available: `/openenv/reset`, `/openenv/step`, `/openenv/state/{session_id}`, `/openenv/tasks`.

### Running Inference

```bash
export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct

python inference.py
```

The inference script runs a baseline LLM agent against all tasks, emitting structured stdout logs:

```
[START] task=easy env=sre_incident_response model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=read_logs(auth-service) reward=0.01 done=false error=null
[STEP] step=2 action=check_metrics(auth-service) reward=0.01 done=false error=null
[STEP] step=3 action=restart_service(auth-service) reward=0.05 done=false error=null
[STEP] step=4 action=submit_diagnosis reward=1.00 done=true error=null
[END] success=true steps=4 score=1.00 rewards=0.01,0.01,0.05,1.00
```

## Baseline Scores

| Task | Expected Score Range | What a Perfect Agent Scores |
|------|---------------------|---------------------------|
| easy | 0.70 - 0.95 | 1.00 |
| medium | 0.40 - 0.75 | 0.90 |
| hard | 0.20 - 0.55 | 0.85 |

## Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
| `HF_TOKEN` | HuggingFace API key | Required |
| `PORT` | Server port | `8000` |
| `SRE_TASKS` | Comma-separated task IDs to run in inference | `easy,medium,hard` |

## OpenEnv Spec Compliance

- `openenv.yaml` with metadata, task definitions, typed models, and runtime config
- `step(action)` returns observation, reward, done, info
- `reset()` returns initial observation
- `state()` returns current episode metadata
- Typed Pydantic models for Action, Observation, and State
- 3 tasks with programmatic graders (easy, medium, hard)
- Scores in 0.0-1.0 range with partial progress signals
- Working Dockerfile for containerized execution
- Baseline inference script (`inference.py`) with reproducible scores