Spaces:

Arjun4707
/

sre-env

Sleeping

App Files Files Community

sre-env / README.md

Arjun4707

Upload folder using huggingface_hub

1fa95ff verified 8 days ago

preview code

raw

history blame contribute delete

6.63 kB

metadata

title: SRE Incident Investigation Environment
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
  - openenv
  - reinforcement-learning
  - agent
  - evaluation
pinned: false
base_path: /web

SRE Incident Investigation Environment

A production-grade OpenEnv environment where an AI agent acts as an on-call Site Reliability Engineer — querying logs, metrics, and alerts to diagnose real-world system failures, then submitting a structured incident report graded by a deterministic rubric.

Why This Exists

Every company running cloud infrastructure deals with production incidents daily. Diagnosing them requires correlating signals across logs, metrics, and alerts; distinguishing root causes from downstream symptoms; and reasoning under time pressure. This is a genuine capability gap for current LLMs. No existing RL benchmark tests it.

Action Space

class SREAction(Action):
    action_type: Literal["query_logs","query_metrics","query_alerts","annotate","submit"]
    service: Optional[str]                    # filter logs by service
    log_level: Optional[str]                  # DEBUG|INFO|WARN|ERROR|FATAL
    time_window_minutes: Optional[int]        # default 30, max 120
    log_query: Optional[str]                  # keyword search
    metric_name: Optional[str]               # error_rate|latency_p99|latency_p50|
                                              # cpu_usage|memory_usage|db_connections|
                                              # request_rate|cache_hit_rate
    note: Optional[str]                       # annotation text
    root_cause_service: Optional[str]         # submit: service name
    root_cause_type: Optional[str]            # submit: failure category
    affected_services: Optional[List[str]]    # submit: blast radius
    severity: Optional[str]                   # submit: P1|P2|P3|P4
    recommended_action: Optional[str]         # submit: remediation text
    confidence: Optional[float]              # submit: 0.0-1.0

Observation Space

class SREObservation(Observation):
    action_taken: str
    logs: List[Dict]            # [{timestamp, service, level, message}]
    metrics: List[Dict]         # [{timestamp, value}]
    metric_name: Optional[str]
    alerts: List[Dict]          # [{alert_name, service, severity, fired_at, message, status}]
    annotation_accepted: bool
    grader_score: Optional[float]    # 0.0-1.0, set after submit
    grader_breakdown: Optional[Dict]
    message: str
    queries_remaining: int           # budget: 12 per episode
    done: bool
    reward: float

Tasks

ID	Difficulty	Title	Root Cause
`sre-easy-001`	Easy	Checkout Failures — Payment Service Crashing	payment-service OOM crash
`sre-medium-002`	Medium	Order Outage — DB Connection Pool Exhaustion	analytics-service holding all DB connections
`sre-hard-003`	Hard	Silent Revenue Corruption	Feature flag changes product ID format, breaking cart pricing silently

Grader (Deterministic, No LLM Judge)

Criterion	Weight	Method
`root_cause_service`	0.35	Exact match
`root_cause_type`	0.25	Exact match
`affected_services`	0.15	F1 score
`severity`	0.10	Exact = 1.0, adjacent = 0.5
`recommended_action`	0.15	Keyword recall

Reward Shaping

Event	Reward
Successful query	+0.02
Annotation	+0.01
Duplicate query	-0.05
Submit	grader score (0.0-1.0)

Baseline Scores (gpt-4o-mini)

Task	Score
Easy	0.87
Medium	0.62
Hard	0.28
Average	0.59

Setup

# Local
pip install openenv-core uvicorn fastapi
uvicorn server.app:app --port 8000

# Docker
docker build -t sre-env .
docker run -d -p 8000:8000 sre-env

# Inference
export OPENAI_API_KEY=sk-...
export ENV_BASE_URL=http://localhost:8000
python inference.py --all-tasks

Quick Start

from client import SREEnvClient
from models import SREAction

# Sync usage (simplest)
with SREEnvClient(base_url="http://localhost:8000").sync() as env:
    result = env.reset(task_id="sre-easy-001")

    result = env.step(SREAction(action_type="query_alerts"))
    result = env.step(SREAction(action_type="query_logs",
        service="payment-service", log_level="ERROR", time_window_minutes=60))
    result = env.step(SREAction(action_type="query_metrics",
        metric_name="memory_usage"))

    result = env.step(SREAction(
        action_type="submit",
        root_cause_service="payment-service",
        root_cause_type="resource_exhaustion",
        affected_services=["payment-service", "api-gateway", "order-service"],
        severity="P2",
        recommended_action="Increase JVM heap memory limit to prevent OOM kills",
        confidence=0.95,
    ))
    print(f"Score: {result.observation.grader_score:.4f}")

# Async usage (for training loops)
import asyncio

async def main():
    async with SREEnvClient(base_url="http://localhost:8000") as env:
        result = await env.reset_async(task_id="sre-easy-001")
        result = await env.step_async(SREAction(action_type="query_alerts"))
        result = await env.step_async(SREAction(
            action_type="submit",
            root_cause_service="payment-service",
            root_cause_type="resource_exhaustion",
            affected_services=["payment-service", "api-gateway", "order-service"],
            severity="P2",
            recommended_action="Increase JVM heap memory limit",
            confidence=0.95,
        ))
        print(f"Score: {result.observation.grader_score:.4f}")

asyncio.run(main())

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check
`/reset`	POST	Start episode (`task_id` or `difficulty`)
`/step`	POST	Execute action
`/state`	GET	Current state
`/schema`	GET	JSON schemas
`/ws`	WebSocket	Persistent session for training
`/web`	GET	Interactive web UI

Project Structure

sre_env/
├── models.py           # Pydantic models
├── client.py           # WebSocket client
├── inference.py        # Baseline agent (OpenAI client)
├── openenv.yaml        # Spec manifest
├── pyproject.toml
├── Dockerfile
├── tasks/
│   └── scenarios.py    # 3 tasks + graders
└── server/
    ├── app.py          # FastAPI server
    └── sre_environment.py