File size: 6,632 Bytes
aea303d 1fa95ff aea303d 1fa95ff aea303d 1fa95ff aea303d 1fa95ff | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | ---
title: SRE Incident Investigation Environment
emoji: π¨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
- openenv
- reinforcement-learning
- agent
- evaluation
pinned: false
base_path: /web
---
# SRE Incident Investigation Environment
[](https://github.com/meta-pytorch/OpenEnv)
A production-grade OpenEnv environment where an AI agent acts as an on-call **Site Reliability Engineer** β querying logs, metrics, and alerts to diagnose real-world system failures, then submitting a structured incident report graded by a deterministic rubric.
## Why This Exists
Every company running cloud infrastructure deals with production incidents daily. Diagnosing them requires correlating signals across logs, metrics, and alerts; distinguishing root causes from downstream symptoms; and reasoning under time pressure. This is a genuine capability gap for current LLMs. No existing RL benchmark tests it.
## Action Space
```python
class SREAction(Action):
action_type: Literal["query_logs","query_metrics","query_alerts","annotate","submit"]
service: Optional[str] # filter logs by service
log_level: Optional[str] # DEBUG|INFO|WARN|ERROR|FATAL
time_window_minutes: Optional[int] # default 30, max 120
log_query: Optional[str] # keyword search
metric_name: Optional[str] # error_rate|latency_p99|latency_p50|
# cpu_usage|memory_usage|db_connections|
# request_rate|cache_hit_rate
note: Optional[str] # annotation text
root_cause_service: Optional[str] # submit: service name
root_cause_type: Optional[str] # submit: failure category
affected_services: Optional[List[str]] # submit: blast radius
severity: Optional[str] # submit: P1|P2|P3|P4
recommended_action: Optional[str] # submit: remediation text
confidence: Optional[float] # submit: 0.0-1.0
```
## Observation Space
```python
class SREObservation(Observation):
action_taken: str
logs: List[Dict] # [{timestamp, service, level, message}]
metrics: List[Dict] # [{timestamp, value}]
metric_name: Optional[str]
alerts: List[Dict] # [{alert_name, service, severity, fired_at, message, status}]
annotation_accepted: bool
grader_score: Optional[float] # 0.0-1.0, set after submit
grader_breakdown: Optional[Dict]
message: str
queries_remaining: int # budget: 12 per episode
done: bool
reward: float
```
## Tasks
| ID | Difficulty | Title | Root Cause |
|---|---|---|---|
| `sre-easy-001` | Easy | Checkout Failures β Payment Service Crashing | payment-service OOM crash |
| `sre-medium-002` | Medium | Order Outage β DB Connection Pool Exhaustion | analytics-service holding all DB connections |
| `sre-hard-003` | Hard | Silent Revenue Corruption | Feature flag changes product ID format, breaking cart pricing silently |
## Grader (Deterministic, No LLM Judge)
| Criterion | Weight | Method |
|---|---|---|
| `root_cause_service` | 0.35 | Exact match |
| `root_cause_type` | 0.25 | Exact match |
| `affected_services` | 0.15 | F1 score |
| `severity` | 0.10 | Exact = 1.0, adjacent = 0.5 |
| `recommended_action` | 0.15 | Keyword recall |
## Reward Shaping
| Event | Reward |
|---|---|
| Successful query | +0.02 |
| Annotation | +0.01 |
| Duplicate query | -0.05 |
| Submit | grader score (0.0-1.0) |
## Baseline Scores (gpt-4o-mini)
| Task | Score |
|---|---|
| Easy | 0.87 |
| Medium | 0.62 |
| Hard | 0.28 |
| **Average** | **0.59** |
## Setup
```bash
# Local
pip install openenv-core uvicorn fastapi
uvicorn server.app:app --port 8000
# Docker
docker build -t sre-env .
docker run -d -p 8000:8000 sre-env
# Inference
export OPENAI_API_KEY=sk-...
export ENV_BASE_URL=http://localhost:8000
python inference.py --all-tasks
```
## Quick Start
```python
from client import SREEnvClient
from models import SREAction
# Sync usage (simplest)
with SREEnvClient(base_url="http://localhost:8000").sync() as env:
result = env.reset(task_id="sre-easy-001")
result = env.step(SREAction(action_type="query_alerts"))
result = env.step(SREAction(action_type="query_logs",
service="payment-service", log_level="ERROR", time_window_minutes=60))
result = env.step(SREAction(action_type="query_metrics",
metric_name="memory_usage"))
result = env.step(SREAction(
action_type="submit",
root_cause_service="payment-service",
root_cause_type="resource_exhaustion",
affected_services=["payment-service", "api-gateway", "order-service"],
severity="P2",
recommended_action="Increase JVM heap memory limit to prevent OOM kills",
confidence=0.95,
))
print(f"Score: {result.observation.grader_score:.4f}")
# Async usage (for training loops)
import asyncio
async def main():
async with SREEnvClient(base_url="http://localhost:8000") as env:
result = await env.reset_async(task_id="sre-easy-001")
result = await env.step_async(SREAction(action_type="query_alerts"))
result = await env.step_async(SREAction(
action_type="submit",
root_cause_service="payment-service",
root_cause_type="resource_exhaustion",
affected_services=["payment-service", "api-gateway", "order-service"],
severity="P2",
recommended_action="Increase JVM heap memory limit",
confidence=0.95,
))
print(f"Score: {result.observation.grader_score:.4f}")
asyncio.run(main())
```
## API Endpoints
| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | Health check |
| `/reset` | POST | Start episode (`task_id` or `difficulty`) |
| `/step` | POST | Execute action |
| `/state` | GET | Current state |
| `/schema` | GET | JSON schemas |
| `/ws` | WebSocket | Persistent session for training |
| `/web` | GET | Interactive web UI |
## Project Structure
```
sre_env/
βββ models.py # Pydantic models
βββ client.py # WebSocket client
βββ inference.py # Baseline agent (OpenAI client)
βββ openenv.yaml # Spec manifest
βββ pyproject.toml
βββ Dockerfile
βββ tasks/
β βββ scenarios.py # 3 tasks + graders
βββ server/
βββ app.py # FastAPI server
βββ sre_environment.py
```
|