| --- |
| title: SRE Incident Investigation Environment |
| emoji: π¨ |
| colorFrom: red |
| colorTo: yellow |
| sdk: docker |
| app_port: 8000 |
| tags: |
| - openenv |
| - reinforcement-learning |
| - agent |
| - evaluation |
| pinned: false |
| base_path: /web |
| --- |
| |
| # SRE Incident Investigation Environment |
|
|
| [](https://github.com/meta-pytorch/OpenEnv) |
|
|
| A production-grade OpenEnv environment where an AI agent acts as an on-call **Site Reliability Engineer** β querying logs, metrics, and alerts to diagnose real-world system failures, then submitting a structured incident report graded by a deterministic rubric. |
|
|
| ## Why This Exists |
|
|
| Every company running cloud infrastructure deals with production incidents daily. Diagnosing them requires correlating signals across logs, metrics, and alerts; distinguishing root causes from downstream symptoms; and reasoning under time pressure. This is a genuine capability gap for current LLMs. No existing RL benchmark tests it. |
|
|
| ## Action Space |
|
|
| ```python |
| class SREAction(Action): |
| action_type: Literal["query_logs","query_metrics","query_alerts","annotate","submit"] |
| service: Optional[str] # filter logs by service |
| log_level: Optional[str] # DEBUG|INFO|WARN|ERROR|FATAL |
| time_window_minutes: Optional[int] # default 30, max 120 |
| log_query: Optional[str] # keyword search |
| metric_name: Optional[str] # error_rate|latency_p99|latency_p50| |
| # cpu_usage|memory_usage|db_connections| |
| # request_rate|cache_hit_rate |
| note: Optional[str] # annotation text |
| root_cause_service: Optional[str] # submit: service name |
| root_cause_type: Optional[str] # submit: failure category |
| affected_services: Optional[List[str]] # submit: blast radius |
| severity: Optional[str] # submit: P1|P2|P3|P4 |
| recommended_action: Optional[str] # submit: remediation text |
| confidence: Optional[float] # submit: 0.0-1.0 |
| ``` |
|
|
| ## Observation Space |
|
|
| ```python |
| class SREObservation(Observation): |
| action_taken: str |
| logs: List[Dict] # [{timestamp, service, level, message}] |
| metrics: List[Dict] # [{timestamp, value}] |
| metric_name: Optional[str] |
| alerts: List[Dict] # [{alert_name, service, severity, fired_at, message, status}] |
| annotation_accepted: bool |
| grader_score: Optional[float] # 0.0-1.0, set after submit |
| grader_breakdown: Optional[Dict] |
| message: str |
| queries_remaining: int # budget: 12 per episode |
| done: bool |
| reward: float |
| ``` |
|
|
| ## Tasks |
|
|
| | ID | Difficulty | Title | Root Cause | |
| |---|---|---|---| |
| | `sre-easy-001` | Easy | Checkout Failures β Payment Service Crashing | payment-service OOM crash | |
| | `sre-medium-002` | Medium | Order Outage β DB Connection Pool Exhaustion | analytics-service holding all DB connections | |
| | `sre-hard-003` | Hard | Silent Revenue Corruption | Feature flag changes product ID format, breaking cart pricing silently | |
|
|
| ## Grader (Deterministic, No LLM Judge) |
|
|
| | Criterion | Weight | Method | |
| |---|---|---| |
| | `root_cause_service` | 0.35 | Exact match | |
| | `root_cause_type` | 0.25 | Exact match | |
| | `affected_services` | 0.15 | F1 score | |
| | `severity` | 0.10 | Exact = 1.0, adjacent = 0.5 | |
| | `recommended_action` | 0.15 | Keyword recall | |
|
|
| ## Reward Shaping |
|
|
| | Event | Reward | |
| |---|---| |
| | Successful query | +0.02 | |
| | Annotation | +0.01 | |
| | Duplicate query | -0.05 | |
| | Submit | grader score (0.0-1.0) | |
|
|
| ## Baseline Scores (gpt-4o-mini) |
|
|
| | Task | Score | |
| |---|---| |
| | Easy | 0.87 | |
| | Medium | 0.62 | |
| | Hard | 0.28 | |
| | **Average** | **0.59** | |
|
|
| ## Setup |
|
|
| ```bash |
| # Local |
| pip install openenv-core uvicorn fastapi |
| uvicorn server.app:app --port 8000 |
| |
| # Docker |
| docker build -t sre-env . |
| docker run -d -p 8000:8000 sre-env |
| |
| # Inference |
| export OPENAI_API_KEY=sk-... |
| export ENV_BASE_URL=http://localhost:8000 |
| python inference.py --all-tasks |
| ``` |
|
|
| ## Quick Start |
|
|
| ```python |
| from client import SREEnvClient |
| from models import SREAction |
| |
| # Sync usage (simplest) |
| with SREEnvClient(base_url="http://localhost:8000").sync() as env: |
| result = env.reset(task_id="sre-easy-001") |
| |
| result = env.step(SREAction(action_type="query_alerts")) |
| result = env.step(SREAction(action_type="query_logs", |
| service="payment-service", log_level="ERROR", time_window_minutes=60)) |
| result = env.step(SREAction(action_type="query_metrics", |
| metric_name="memory_usage")) |
| |
| result = env.step(SREAction( |
| action_type="submit", |
| root_cause_service="payment-service", |
| root_cause_type="resource_exhaustion", |
| affected_services=["payment-service", "api-gateway", "order-service"], |
| severity="P2", |
| recommended_action="Increase JVM heap memory limit to prevent OOM kills", |
| confidence=0.95, |
| )) |
| print(f"Score: {result.observation.grader_score:.4f}") |
| |
| # Async usage (for training loops) |
| import asyncio |
| |
| async def main(): |
| async with SREEnvClient(base_url="http://localhost:8000") as env: |
| result = await env.reset_async(task_id="sre-easy-001") |
| result = await env.step_async(SREAction(action_type="query_alerts")) |
| result = await env.step_async(SREAction( |
| action_type="submit", |
| root_cause_service="payment-service", |
| root_cause_type="resource_exhaustion", |
| affected_services=["payment-service", "api-gateway", "order-service"], |
| severity="P2", |
| recommended_action="Increase JVM heap memory limit", |
| confidence=0.95, |
| )) |
| print(f"Score: {result.observation.grader_score:.4f}") |
| |
| asyncio.run(main()) |
| ``` |
|
|
| ## API Endpoints |
|
|
| | Endpoint | Method | Description | |
| |---|---|---| |
| | `/health` | GET | Health check | |
| | `/reset` | POST | Start episode (`task_id` or `difficulty`) | |
| | `/step` | POST | Execute action | |
| | `/state` | GET | Current state | |
| | `/schema` | GET | JSON schemas | |
| | `/ws` | WebSocket | Persistent session for training | |
| | `/web` | GET | Interactive web UI | |
|
|
| ## Project Structure |
|
|
| ``` |
| sre_env/ |
| βββ models.py # Pydantic models |
| βββ client.py # WebSocket client |
| βββ inference.py # Baseline agent (OpenAI client) |
| βββ openenv.yaml # Spec manifest |
| βββ pyproject.toml |
| βββ Dockerfile |
| βββ tasks/ |
| β βββ scenarios.py # 3 tasks + graders |
| βββ server/ |
| βββ app.py # FastAPI server |
| βββ sre_environment.py |
| ``` |
|
|