sre-env / README.md
Arjun4707's picture
Upload folder using huggingface_hub
1fa95ff verified
metadata
title: SRE Incident Investigation Environment
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
  - openenv
  - reinforcement-learning
  - agent
  - evaluation
pinned: false
base_path: /web

SRE Incident Investigation Environment

OpenEnv

A production-grade OpenEnv environment where an AI agent acts as an on-call Site Reliability Engineer β€” querying logs, metrics, and alerts to diagnose real-world system failures, then submitting a structured incident report graded by a deterministic rubric.

Why This Exists

Every company running cloud infrastructure deals with production incidents daily. Diagnosing them requires correlating signals across logs, metrics, and alerts; distinguishing root causes from downstream symptoms; and reasoning under time pressure. This is a genuine capability gap for current LLMs. No existing RL benchmark tests it.

Action Space

class SREAction(Action):
    action_type: Literal["query_logs","query_metrics","query_alerts","annotate","submit"]
    service: Optional[str]                    # filter logs by service
    log_level: Optional[str]                  # DEBUG|INFO|WARN|ERROR|FATAL
    time_window_minutes: Optional[int]        # default 30, max 120
    log_query: Optional[str]                  # keyword search
    metric_name: Optional[str]               # error_rate|latency_p99|latency_p50|
                                              # cpu_usage|memory_usage|db_connections|
                                              # request_rate|cache_hit_rate
    note: Optional[str]                       # annotation text
    root_cause_service: Optional[str]         # submit: service name
    root_cause_type: Optional[str]            # submit: failure category
    affected_services: Optional[List[str]]    # submit: blast radius
    severity: Optional[str]                   # submit: P1|P2|P3|P4
    recommended_action: Optional[str]         # submit: remediation text
    confidence: Optional[float]              # submit: 0.0-1.0

Observation Space

class SREObservation(Observation):
    action_taken: str
    logs: List[Dict]            # [{timestamp, service, level, message}]
    metrics: List[Dict]         # [{timestamp, value}]
    metric_name: Optional[str]
    alerts: List[Dict]          # [{alert_name, service, severity, fired_at, message, status}]
    annotation_accepted: bool
    grader_score: Optional[float]    # 0.0-1.0, set after submit
    grader_breakdown: Optional[Dict]
    message: str
    queries_remaining: int           # budget: 12 per episode
    done: bool
    reward: float

Tasks

ID Difficulty Title Root Cause
sre-easy-001 Easy Checkout Failures β€” Payment Service Crashing payment-service OOM crash
sre-medium-002 Medium Order Outage β€” DB Connection Pool Exhaustion analytics-service holding all DB connections
sre-hard-003 Hard Silent Revenue Corruption Feature flag changes product ID format, breaking cart pricing silently

Grader (Deterministic, No LLM Judge)

Criterion Weight Method
root_cause_service 0.35 Exact match
root_cause_type 0.25 Exact match
affected_services 0.15 F1 score
severity 0.10 Exact = 1.0, adjacent = 0.5
recommended_action 0.15 Keyword recall

Reward Shaping

Event Reward
Successful query +0.02
Annotation +0.01
Duplicate query -0.05
Submit grader score (0.0-1.0)

Baseline Scores (gpt-4o-mini)

Task Score
Easy 0.87
Medium 0.62
Hard 0.28
Average 0.59

Setup

# Local
pip install openenv-core uvicorn fastapi
uvicorn server.app:app --port 8000

# Docker
docker build -t sre-env .
docker run -d -p 8000:8000 sre-env

# Inference
export OPENAI_API_KEY=sk-...
export ENV_BASE_URL=http://localhost:8000
python inference.py --all-tasks

Quick Start

from client import SREEnvClient
from models import SREAction

# Sync usage (simplest)
with SREEnvClient(base_url="http://localhost:8000").sync() as env:
    result = env.reset(task_id="sre-easy-001")

    result = env.step(SREAction(action_type="query_alerts"))
    result = env.step(SREAction(action_type="query_logs",
        service="payment-service", log_level="ERROR", time_window_minutes=60))
    result = env.step(SREAction(action_type="query_metrics",
        metric_name="memory_usage"))

    result = env.step(SREAction(
        action_type="submit",
        root_cause_service="payment-service",
        root_cause_type="resource_exhaustion",
        affected_services=["payment-service", "api-gateway", "order-service"],
        severity="P2",
        recommended_action="Increase JVM heap memory limit to prevent OOM kills",
        confidence=0.95,
    ))
    print(f"Score: {result.observation.grader_score:.4f}")

# Async usage (for training loops)
import asyncio

async def main():
    async with SREEnvClient(base_url="http://localhost:8000") as env:
        result = await env.reset_async(task_id="sre-easy-001")
        result = await env.step_async(SREAction(action_type="query_alerts"))
        result = await env.step_async(SREAction(
            action_type="submit",
            root_cause_service="payment-service",
            root_cause_type="resource_exhaustion",
            affected_services=["payment-service", "api-gateway", "order-service"],
            severity="P2",
            recommended_action="Increase JVM heap memory limit",
            confidence=0.95,
        ))
        print(f"Score: {result.observation.grader_score:.4f}")

asyncio.run(main())

API Endpoints

Endpoint Method Description
/health GET Health check
/reset POST Start episode (task_id or difficulty)
/step POST Execute action
/state GET Current state
/schema GET JSON schemas
/ws WebSocket Persistent session for training
/web GET Interactive web UI

Project Structure

sre_env/
β”œβ”€β”€ models.py           # Pydantic models
β”œβ”€β”€ client.py           # WebSocket client
β”œβ”€β”€ inference.py        # Baseline agent (OpenAI client)
β”œβ”€β”€ openenv.yaml        # Spec manifest
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ tasks/
β”‚   └── scenarios.py    # 3 tasks + graders
└── server/
    β”œβ”€β”€ app.py          # FastAPI server
    └── sre_environment.py