metadata title: SRE Incident Investigation Environment
emoji: π¨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
- openenv
- reinforcement-learning
- agent
- evaluation
pinned: false
base_path: /web
SRE Incident Investigation Environment
A production-grade OpenEnv environment where an AI agent acts as an on-call Site Reliability Engineer β querying logs, metrics, and alerts to diagnose real-world system failures, then submitting a structured incident report graded by a deterministic rubric.
Why This Exists
Every company running cloud infrastructure deals with production incidents daily. Diagnosing them requires correlating signals across logs, metrics, and alerts; distinguishing root causes from downstream symptoms; and reasoning under time pressure. This is a genuine capability gap for current LLMs. No existing RL benchmark tests it.
Action Space
class SREAction (Action ):
action_type: Literal ["query_logs" ,"query_metrics" ,"query_alerts" ,"annotate" ,"submit" ]
service: Optional [str ]
log_level: Optional [str ]
time_window_minutes: Optional [int ]
log_query: Optional [str ]
metric_name: Optional [str ]
note: Optional [str ]
root_cause_service: Optional [str ]
root_cause_type: Optional [str ]
affected_services: Optional [List [str ]]
severity: Optional [str ]
recommended_action: Optional [str ]
confidence: Optional [float ]
Observation Space
class SREObservation (Observation ):
action_taken: str
logs: List [Dict ]
metrics: List [Dict ]
metric_name: Optional [str ]
alerts: List [Dict ]
annotation_accepted: bool
grader_score: Optional [float ]
grader_breakdown: Optional [Dict ]
message: str
queries_remaining: int
done: bool
reward: float
Tasks
ID
Difficulty
Title
Root Cause
sre-easy-001
Easy
Checkout Failures β Payment Service Crashing
payment-service OOM crash
sre-medium-002
Medium
Order Outage β DB Connection Pool Exhaustion
analytics-service holding all DB connections
sre-hard-003
Hard
Silent Revenue Corruption
Feature flag changes product ID format, breaking cart pricing silently
Grader (Deterministic, No LLM Judge)
Criterion
Weight
Method
root_cause_service
0.35
Exact match
root_cause_type
0.25
Exact match
affected_services
0.15
F1 score
severity
0.10
Exact = 1.0, adjacent = 0.5
recommended_action
0.15
Keyword recall
Reward Shaping
Event
Reward
Successful query
+0.02
Annotation
+0.01
Duplicate query
-0.05
Submit
grader score (0.0-1.0)
Baseline Scores (gpt-4o-mini)
Task
Score
Easy
0.87
Medium
0.62
Hard
0.28
Average
0.59
Setup
pip install openenv-core uvicorn fastapi
uvicorn server.app:app --port 8000
docker build -t sre-env .
docker run -d -p 8000:8000 sre-env
export OPENAI_API_KEY=sk-...
export ENV_BASE_URL=http://localhost:8000
python inference.py --all-tasks
Quick Start
from client import SREEnvClient
from models import SREAction
with SREEnvClient(base_url="http://localhost:8000" ).sync() as env:
result = env.reset(task_id="sre-easy-001" )
result = env.step(SREAction(action_type="query_alerts" ))
result = env.step(SREAction(action_type="query_logs" ,
service="payment-service" , log_level="ERROR" , time_window_minutes=60 ))
result = env.step(SREAction(action_type="query_metrics" ,
metric_name="memory_usage" ))
result = env.step(SREAction(
action_type="submit" ,
root_cause_service="payment-service" ,
root_cause_type="resource_exhaustion" ,
affected_services=["payment-service" , "api-gateway" , "order-service" ],
severity="P2" ,
recommended_action="Increase JVM heap memory limit to prevent OOM kills" ,
confidence=0.95 ,
))
print (f"Score: {result.observation.grader_score:.4 f} " )
import asyncio
async def main ():
async with SREEnvClient(base_url="http://localhost:8000" ) as env:
result = await env.reset_async(task_id="sre-easy-001" )
result = await env.step_async(SREAction(action_type="query_alerts" ))
result = await env.step_async(SREAction(
action_type="submit" ,
root_cause_service="payment-service" ,
root_cause_type="resource_exhaustion" ,
affected_services=["payment-service" , "api-gateway" , "order-service" ],
severity="P2" ,
recommended_action="Increase JVM heap memory limit" ,
confidence=0.95 ,
))
print (f"Score: {result.observation.grader_score:.4 f} " )
asyncio.run(main())
API Endpoints
Endpoint
Method
Description
/health
GET
Health check
/reset
POST
Start episode (task_id or difficulty)
/step
POST
Execute action
/state
GET
Current state
/schema
GET
JSON schemas
/ws
WebSocket
Persistent session for training
/web
GET
Interactive web UI
Project Structure
sre_env/
βββ models.py # Pydantic models
βββ client.py # WebSocket client
βββ inference.py # Baseline agent (OpenAI client)
βββ openenv.yaml # Spec manifest
βββ pyproject.toml
βββ Dockerfile
βββ tasks/
β βββ scenarios.py # 3 tasks + graders
βββ server/
βββ app.py # FastAPI server
βββ sre_environment.py