Spaces:

Arjun4707
/

sre-env

Sleeping

File size: 6,632 Bytes

aea303d
1fa95ff
 
 
 
aea303d
1fa95ff
 
 
 
 
 
aea303d
1fa95ff
aea303d
 
1fa95ff

---
title: SRE Incident Investigation Environment
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
  - openenv
  - reinforcement-learning
  - agent
  - evaluation
pinned: false
base_path: /web
---

# SRE Incident Investigation Environment

[![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-blue)](https://github.com/meta-pytorch/OpenEnv)

A production-grade OpenEnv environment where an AI agent acts as an on-call **Site Reliability Engineer** — querying logs, metrics, and alerts to diagnose real-world system failures, then submitting a structured incident report graded by a deterministic rubric.

## Why This Exists

Every company running cloud infrastructure deals with production incidents daily. Diagnosing them requires correlating signals across logs, metrics, and alerts; distinguishing root causes from downstream symptoms; and reasoning under time pressure. This is a genuine capability gap for current LLMs. No existing RL benchmark tests it.

## Action Space

```python
class SREAction(Action):
    action_type: Literal["query_logs","query_metrics","query_alerts","annotate","submit"]
    service: Optional[str]                    # filter logs by service
    log_level: Optional[str]                  # DEBUG|INFO|WARN|ERROR|FATAL
    time_window_minutes: Optional[int]        # default 30, max 120
    log_query: Optional[str]                  # keyword search
    metric_name: Optional[str]               # error_rate|latency_p99|latency_p50|
                                              # cpu_usage|memory_usage|db_connections|
                                              # request_rate|cache_hit_rate
    note: Optional[str]                       # annotation text
    root_cause_service: Optional[str]         # submit: service name
    root_cause_type: Optional[str]            # submit: failure category
    affected_services: Optional[List[str]]    # submit: blast radius
    severity: Optional[str]                   # submit: P1|P2|P3|P4
    recommended_action: Optional[str]         # submit: remediation text
    confidence: Optional[float]              # submit: 0.0-1.0
```

## Observation Space

```python
class SREObservation(Observation):
    action_taken: str
    logs: List[Dict]            # [{timestamp, service, level, message}]
    metrics: List[Dict]         # [{timestamp, value}]
    metric_name: Optional[str]
    alerts: List[Dict]          # [{alert_name, service, severity, fired_at, message, status}]
    annotation_accepted: bool
    grader_score: Optional[float]    # 0.0-1.0, set after submit
    grader_breakdown: Optional[Dict]
    message: str
    queries_remaining: int           # budget: 12 per episode
    done: bool
    reward: float
```

## Tasks

| ID | Difficulty | Title | Root Cause |
|---|---|---|---|
| `sre-easy-001` | Easy | Checkout Failures — Payment Service Crashing | payment-service OOM crash |
| `sre-medium-002` | Medium | Order Outage — DB Connection Pool Exhaustion | analytics-service holding all DB connections |
| `sre-hard-003` | Hard | Silent Revenue Corruption | Feature flag changes product ID format, breaking cart pricing silently |

## Grader (Deterministic, No LLM Judge)

| Criterion | Weight | Method |
|---|---|---|
| `root_cause_service` | 0.35 | Exact match |
| `root_cause_type` | 0.25 | Exact match |
| `affected_services` | 0.15 | F1 score |
| `severity` | 0.10 | Exact = 1.0, adjacent = 0.5 |
| `recommended_action` | 0.15 | Keyword recall |

## Reward Shaping

| Event | Reward |
|---|---|
| Successful query | +0.02 |
| Annotation | +0.01 |
| Duplicate query | -0.05 |
| Submit | grader score (0.0-1.0) |

## Baseline Scores (gpt-4o-mini)

| Task | Score |
|---|---|
| Easy | 0.87 |
| Medium | 0.62 |
| Hard | 0.28 |
| **Average** | **0.59** |

## Setup

```bash
# Local
pip install openenv-core uvicorn fastapi
uvicorn server.app:app --port 8000

# Docker
docker build -t sre-env .
docker run -d -p 8000:8000 sre-env

# Inference
export OPENAI_API_KEY=sk-...
export ENV_BASE_URL=http://localhost:8000
python inference.py --all-tasks
```

## Quick Start

```python
from client import SREEnvClient
from models import SREAction

# Sync usage (simplest)
with SREEnvClient(base_url="http://localhost:8000").sync() as env:
    result = env.reset(task_id="sre-easy-001")

    result = env.step(SREAction(action_type="query_alerts"))
    result = env.step(SREAction(action_type="query_logs",
        service="payment-service", log_level="ERROR", time_window_minutes=60))
    result = env.step(SREAction(action_type="query_metrics",
        metric_name="memory_usage"))

    result = env.step(SREAction(
        action_type="submit",
        root_cause_service="payment-service",
        root_cause_type="resource_exhaustion",
        affected_services=["payment-service", "api-gateway", "order-service"],
        severity="P2",
        recommended_action="Increase JVM heap memory limit to prevent OOM kills",
        confidence=0.95,
    ))
    print(f"Score: {result.observation.grader_score:.4f}")

# Async usage (for training loops)
import asyncio

async def main():
    async with SREEnvClient(base_url="http://localhost:8000") as env:
        result = await env.reset_async(task_id="sre-easy-001")
        result = await env.step_async(SREAction(action_type="query_alerts"))
        result = await env.step_async(SREAction(
            action_type="submit",
            root_cause_service="payment-service",
            root_cause_type="resource_exhaustion",
            affected_services=["payment-service", "api-gateway", "order-service"],
            severity="P2",
            recommended_action="Increase JVM heap memory limit",
            confidence=0.95,
        ))
        print(f"Score: {result.observation.grader_score:.4f}")

asyncio.run(main())
```

## API Endpoints

| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | Health check |
| `/reset` | POST | Start episode (`task_id` or `difficulty`) |
| `/step` | POST | Execute action |
| `/state` | GET | Current state |
| `/schema` | GET | JSON schemas |
| `/ws` | WebSocket | Persistent session for training |
| `/web` | GET | Interactive web UI |

## Project Structure

```
sre_env/
├── models.py           # Pydantic models
├── client.py           # WebSocket client
├── inference.py        # Baseline agent (OpenAI client)
├── openenv.yaml        # Spec manifest
├── pyproject.toml
├── Dockerfile
├── tasks/
│   └── scenarios.py    # 3 tasks + graders
└── server/
    ├── app.py          # FastAPI server
    └── sre_environment.py
```