Spaces:

Arjun4707
/

sre-env

Sleeping

App Files Files Community

sre-env / README.md

Arjun4707

Upload folder using huggingface_hub

1fa95ff verified 8 days ago

preview code

raw

history blame contribute delete

6.63 kB

	---
	title: SRE Incident Investigation Environment
	emoji: 🚨
	colorFrom: red
	colorTo: yellow
	sdk: docker
	app_port: 8000
	tags:
	- openenv
	- reinforcement-learning
	- agent
	- evaluation
	pinned: false
	base_path: /web
	---

	# SRE Incident Investigation Environment

	[![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-blue)](https://github.com/meta-pytorch/OpenEnv)

	A production-grade OpenEnv environment where an AI agent acts as an on-call Site Reliability Engineer — querying logs, metrics, and alerts to diagnose real-world system failures, then submitting a structured incident report graded by a deterministic rubric.

	## Why This Exists

	Every company running cloud infrastructure deals with production incidents daily. Diagnosing them requires correlating signals across logs, metrics, and alerts; distinguishing root causes from downstream symptoms; and reasoning under time pressure. This is a genuine capability gap for current LLMs. No existing RL benchmark tests it.

	## Action Space

	```python
	class SREAction(Action):
	action_type: Literal["query_logs","query_metrics","query_alerts","annotate","submit"]
	service: Optional[str] # filter logs by service
	log_level: Optional[str] # DEBUG\|INFO\|WARN\|ERROR\|FATAL
	time_window_minutes: Optional[int] # default 30, max 120
	log_query: Optional[str] # keyword search
	metric_name: Optional[str] # error_rate\|latency_p99\|latency_p50\|
	# cpu_usage\|memory_usage\|db_connections\|
	# request_rate\|cache_hit_rate
	note: Optional[str] # annotation text
	root_cause_service: Optional[str] # submit: service name
	root_cause_type: Optional[str] # submit: failure category
	affected_services: Optional[List[str]] # submit: blast radius
	severity: Optional[str] # submit: P1\|P2\|P3\|P4
	recommended_action: Optional[str] # submit: remediation text
	confidence: Optional[float] # submit: 0.0-1.0
	```

	## Observation Space

	```python
	class SREObservation(Observation):
	action_taken: str
	logs: List[Dict] # [{timestamp, service, level, message}]
	metrics: List[Dict] # [{timestamp, value}]
	metric_name: Optional[str]
	alerts: List[Dict] # [{alert_name, service, severity, fired_at, message, status}]
	annotation_accepted: bool
	grader_score: Optional[float] # 0.0-1.0, set after submit
	grader_breakdown: Optional[Dict]
	message: str
	queries_remaining: int # budget: 12 per episode
	done: bool
	reward: float
	```

	## Tasks

	\| ID \| Difficulty \| Title \| Root Cause \|
	\|---\|---\|---\|---\|
	\| `sre-easy-001` \| Easy \| Checkout Failures — Payment Service Crashing \| payment-service OOM crash \|
	\| `sre-medium-002` \| Medium \| Order Outage — DB Connection Pool Exhaustion \| analytics-service holding all DB connections \|
	\| `sre-hard-003` \| Hard \| Silent Revenue Corruption \| Feature flag changes product ID format, breaking cart pricing silently \|

	## Grader (Deterministic, No LLM Judge)

	\| Criterion \| Weight \| Method \|
	\|---\|---\|---\|
	\| `root_cause_service` \| 0.35 \| Exact match \|
	\| `root_cause_type` \| 0.25 \| Exact match \|
	\| `affected_services` \| 0.15 \| F1 score \|
	\| `severity` \| 0.10 \| Exact = 1.0, adjacent = 0.5 \|
	\| `recommended_action` \| 0.15 \| Keyword recall \|

	## Reward Shaping

	\| Event \| Reward \|
	\|---\|---\|
	\| Successful query \| +0.02 \|
	\| Annotation \| +0.01 \|
	\| Duplicate query \| -0.05 \|
	\| Submit \| grader score (0.0-1.0) \|

	## Baseline Scores (gpt-4o-mini)

	\| Task \| Score \|
	\|---\|---\|
	\| Easy \| 0.87 \|
	\| Medium \| 0.62 \|
	\| Hard \| 0.28 \|
	\| Average \| 0.59 \|

	## Setup

	```bash
	# Local
	pip install openenv-core uvicorn fastapi
	uvicorn server.app:app --port 8000

	# Docker
	docker build -t sre-env .
	docker run -d -p 8000:8000 sre-env

	# Inference
	export OPENAI_API_KEY=sk-...
	export ENV_BASE_URL=http://localhost:8000
	python inference.py --all-tasks
	```

	## Quick Start

	```python
	from client import SREEnvClient
	from models import SREAction

	# Sync usage (simplest)
	with SREEnvClient(base_url="http://localhost:8000").sync() as env:
	result = env.reset(task_id="sre-easy-001")

	result = env.step(SREAction(action_type="query_alerts"))
	result = env.step(SREAction(action_type="query_logs",
	service="payment-service", log_level="ERROR", time_window_minutes=60))
	result = env.step(SREAction(action_type="query_metrics",
	metric_name="memory_usage"))

	result = env.step(SREAction(
	action_type="submit",
	root_cause_service="payment-service",
	root_cause_type="resource_exhaustion",
	affected_services=["payment-service", "api-gateway", "order-service"],
	severity="P2",
	recommended_action="Increase JVM heap memory limit to prevent OOM kills",
	confidence=0.95,
	))
	print(f"Score: {result.observation.grader_score:.4f}")

	# Async usage (for training loops)
	import asyncio

	async def main():
	async with SREEnvClient(base_url="http://localhost:8000") as env:
	result = await env.reset_async(task_id="sre-easy-001")
	result = await env.step_async(SREAction(action_type="query_alerts"))
	result = await env.step_async(SREAction(
	action_type="submit",
	root_cause_service="payment-service",
	root_cause_type="resource_exhaustion",
	affected_services=["payment-service", "api-gateway", "order-service"],
	severity="P2",
	recommended_action="Increase JVM heap memory limit",
	confidence=0.95,
	))
	print(f"Score: {result.observation.grader_score:.4f}")

	asyncio.run(main())
	```

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\|---\|---\|---\|
	\| `/health` \| GET \| Health check \|
	\| `/reset` \| POST \| Start episode (`task_id` or `difficulty`) \|
	\| `/step` \| POST \| Execute action \|
	\| `/state` \| GET \| Current state \|
	\| `/schema` \| GET \| JSON schemas \|
	\| `/ws` \| WebSocket \| Persistent session for training \|
	\| `/web` \| GET \| Interactive web UI \|

	## Project Structure

	```
	sre_env/
	├── models.py # Pydantic models
	├── client.py # WebSocket client
	├── inference.py # Baseline agent (OpenAI client)
	├── openenv.yaml # Spec manifest
	├── pyproject.toml
	├── Dockerfile
	├── tasks/
	│ └── scenarios.py # 3 tasks + graders
	└── server/
	├── app.py # FastAPI server
	└── sre_environment.py
	```