sre-env / README.md
Arjun4707's picture
Upload folder using huggingface_hub
1fa95ff verified
---
title: SRE Incident Investigation Environment
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
- openenv
- reinforcement-learning
- agent
- evaluation
pinned: false
base_path: /web
---
# SRE Incident Investigation Environment
[![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-blue)](https://github.com/meta-pytorch/OpenEnv)
A production-grade OpenEnv environment where an AI agent acts as an on-call **Site Reliability Engineer** β€” querying logs, metrics, and alerts to diagnose real-world system failures, then submitting a structured incident report graded by a deterministic rubric.
## Why This Exists
Every company running cloud infrastructure deals with production incidents daily. Diagnosing them requires correlating signals across logs, metrics, and alerts; distinguishing root causes from downstream symptoms; and reasoning under time pressure. This is a genuine capability gap for current LLMs. No existing RL benchmark tests it.
## Action Space
```python
class SREAction(Action):
action_type: Literal["query_logs","query_metrics","query_alerts","annotate","submit"]
service: Optional[str] # filter logs by service
log_level: Optional[str] # DEBUG|INFO|WARN|ERROR|FATAL
time_window_minutes: Optional[int] # default 30, max 120
log_query: Optional[str] # keyword search
metric_name: Optional[str] # error_rate|latency_p99|latency_p50|
# cpu_usage|memory_usage|db_connections|
# request_rate|cache_hit_rate
note: Optional[str] # annotation text
root_cause_service: Optional[str] # submit: service name
root_cause_type: Optional[str] # submit: failure category
affected_services: Optional[List[str]] # submit: blast radius
severity: Optional[str] # submit: P1|P2|P3|P4
recommended_action: Optional[str] # submit: remediation text
confidence: Optional[float] # submit: 0.0-1.0
```
## Observation Space
```python
class SREObservation(Observation):
action_taken: str
logs: List[Dict] # [{timestamp, service, level, message}]
metrics: List[Dict] # [{timestamp, value}]
metric_name: Optional[str]
alerts: List[Dict] # [{alert_name, service, severity, fired_at, message, status}]
annotation_accepted: bool
grader_score: Optional[float] # 0.0-1.0, set after submit
grader_breakdown: Optional[Dict]
message: str
queries_remaining: int # budget: 12 per episode
done: bool
reward: float
```
## Tasks
| ID | Difficulty | Title | Root Cause |
|---|---|---|---|
| `sre-easy-001` | Easy | Checkout Failures β€” Payment Service Crashing | payment-service OOM crash |
| `sre-medium-002` | Medium | Order Outage β€” DB Connection Pool Exhaustion | analytics-service holding all DB connections |
| `sre-hard-003` | Hard | Silent Revenue Corruption | Feature flag changes product ID format, breaking cart pricing silently |
## Grader (Deterministic, No LLM Judge)
| Criterion | Weight | Method |
|---|---|---|
| `root_cause_service` | 0.35 | Exact match |
| `root_cause_type` | 0.25 | Exact match |
| `affected_services` | 0.15 | F1 score |
| `severity` | 0.10 | Exact = 1.0, adjacent = 0.5 |
| `recommended_action` | 0.15 | Keyword recall |
## Reward Shaping
| Event | Reward |
|---|---|
| Successful query | +0.02 |
| Annotation | +0.01 |
| Duplicate query | -0.05 |
| Submit | grader score (0.0-1.0) |
## Baseline Scores (gpt-4o-mini)
| Task | Score |
|---|---|
| Easy | 0.87 |
| Medium | 0.62 |
| Hard | 0.28 |
| **Average** | **0.59** |
## Setup
```bash
# Local
pip install openenv-core uvicorn fastapi
uvicorn server.app:app --port 8000
# Docker
docker build -t sre-env .
docker run -d -p 8000:8000 sre-env
# Inference
export OPENAI_API_KEY=sk-...
export ENV_BASE_URL=http://localhost:8000
python inference.py --all-tasks
```
## Quick Start
```python
from client import SREEnvClient
from models import SREAction
# Sync usage (simplest)
with SREEnvClient(base_url="http://localhost:8000").sync() as env:
result = env.reset(task_id="sre-easy-001")
result = env.step(SREAction(action_type="query_alerts"))
result = env.step(SREAction(action_type="query_logs",
service="payment-service", log_level="ERROR", time_window_minutes=60))
result = env.step(SREAction(action_type="query_metrics",
metric_name="memory_usage"))
result = env.step(SREAction(
action_type="submit",
root_cause_service="payment-service",
root_cause_type="resource_exhaustion",
affected_services=["payment-service", "api-gateway", "order-service"],
severity="P2",
recommended_action="Increase JVM heap memory limit to prevent OOM kills",
confidence=0.95,
))
print(f"Score: {result.observation.grader_score:.4f}")
# Async usage (for training loops)
import asyncio
async def main():
async with SREEnvClient(base_url="http://localhost:8000") as env:
result = await env.reset_async(task_id="sre-easy-001")
result = await env.step_async(SREAction(action_type="query_alerts"))
result = await env.step_async(SREAction(
action_type="submit",
root_cause_service="payment-service",
root_cause_type="resource_exhaustion",
affected_services=["payment-service", "api-gateway", "order-service"],
severity="P2",
recommended_action="Increase JVM heap memory limit",
confidence=0.95,
))
print(f"Score: {result.observation.grader_score:.4f}")
asyncio.run(main())
```
## API Endpoints
| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | Health check |
| `/reset` | POST | Start episode (`task_id` or `difficulty`) |
| `/step` | POST | Execute action |
| `/state` | GET | Current state |
| `/schema` | GET | JSON schemas |
| `/ws` | WebSocket | Persistent session for training |
| `/web` | GET | Interactive web UI |
## Project Structure
```
sre_env/
β”œβ”€β”€ models.py # Pydantic models
β”œβ”€β”€ client.py # WebSocket client
β”œβ”€β”€ inference.py # Baseline agent (OpenAI client)
β”œβ”€β”€ openenv.yaml # Spec manifest
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ tasks/
β”‚ └── scenarios.py # 3 tasks + graders
└── server/
β”œβ”€β”€ app.py # FastAPI server
└── sre_environment.py
```