Sre-Validation / README.md
abdur0001's picture
fix: colorTo color
636a5fe
---
title: SRE Incident Response
emoji: "\U0001F6A8"
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
- openenv
pinned: false
---
# SRE Incident Response Environment
An OpenEnv-compatible reinforcement learning environment that simulates production incident response. AI agents must investigate microservice architectures, diagnose root causes, and apply fixes β€” just like a real on-call SRE engineer.
## Motivation
Every tech company has on-call rotations, yet there's no standardized benchmark for evaluating AI agents on incident response. This environment fills that gap by simulating realistic production incidents with:
- **Multi-service architectures** with dependency chains and cascading failures
- **Progressive information revelation** β€” agents must actively investigate (read logs, check metrics, trace requests)
- **Red herrings and misleading symptoms** β€” alerts point to symptoms, not root causes
- **Concurrent faults** in the hardest tier β€” testing whether agents can find multiple independent root causes
- **Realistic operational data** β€” 50+ log lines per service with noise, time-series metrics, distributed traces, deploy history, runbooks, and config diffs
## Service Architecture
All tasks share the same 7-service microservice architecture:
```
+--------------+
+-------->| auth-service |<------+
| +------+-------+ |
| | depends | depends
+---------+------+ +------v------+ +-----+--------+
| api-gateway | | cache-redis | | notification |
| (entry point) | +-------------+ | -service |
+-+----------+---+ +--------------+
| |
| depends | depends
v v
+------------+ +-----------------+
|user-service| |payment-service |
+-----+------+ +--------+--------+
| depends | depends
v v
+----------------------------+
| db-postgres |
+----------------------------+
```
Each service has: name, status (`HEALTHY`/`DEGRADED`/`DOWN`), version, replica count, dependencies, logs, metrics, traces, deploy history, config, and runbook data.
## Tasks
Tasks are auto-discovered from the `tasks/` directory. Each task is a self-contained Python file defining a `SCENARIO` object.
| Task ID | Name | Difficulty | Max Steps | Root Cause | Fix Required |
|---------|------|-----------|-----------|------------|--------------|
| `easy` | Single Service OOM Crash | Easy | 15 | `auth-service` (OOM) | `restart_service(auth-service)` |
| `medium` | Cascading Database Deadlock | Medium | 25 | `db-postgres` (deadlock) | `restart_service(db-postgres)` |
| `hard` | Concurrent Faults + Misleading Evidence | Hard | 35 | `payment-service` (bad deploy) AND `cache-redis` (memory leak) | `rollback_deploy(payment-service, v3.8.1)` AND `restart_service(cache-redis)` |
### Task Details
**Easy** β€” Alert directly names `auth-service` as down. Logs clearly show OOM crash cycle (heap growth, OOM kills, restart exhaustion). Single root cause, single fix.
**Medium** β€” Alerts blame `payment-service` and `user-service` (both are victims). The real cause is a long-running analytics query deadlocking `db-postgres`. Agent must notice "writes fail but reads work", follow dependency chain to the database, and read `db-postgres` logs to find the deadlock. Red herring: `cache-redis` miss ratio alert (benign TTL expiry).
**Hard** β€” Two independent faults at the same time: (1) `payment-service` has a bad deploy (v3.8.2, NullPointerException in new validator module), (2) `cache-redis` has a memory leak causing eviction storms that degrade `auth-service`. Red herrings: `user-service` config warnings (benign), `notification-service` queue backup (victim of auth-service). Agent must find and fix BOTH faults. After fixing only one, post-remediation check shows remaining services are still unhealthy.
### Adding New Tasks
To add a new task:
1. Create a new file in `tasks/` (e.g., `tasks/my_new_task.py`)
2. Define a `SCENARIO = IncidentScenario(task_id="my_new_task", ...)` β€” see existing task files for the template
3. Done. The task loader in `tasks/__init__.py` auto-discovers any `.py` file that exports a `SCENARIO` object.
No changes needed to the environment engine, grader, server, or inference script. The grader is generic β€” it reads ground truth (root cause services, required fixes, keywords, weights) from the scenario definition.
## Project Structure
```
IncidentResponse_RL/
β”œβ”€β”€ models.py # Pydantic models: Action, Observation, State, enums
β”œβ”€β”€ openenv.yaml # OpenEnv manifest (tasks, models, runtime config)
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ Dockerfile # Container for HF Spaces deployment
β”œβ”€β”€ inference.py # Baseline agent using OpenAI client
β”œβ”€β”€ README.md
β”‚
β”œβ”€β”€ env/ # Core environment engine
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ scenario.py # IncidentScenario, ServiceConfig, RequiredFix dataclasses
β”‚ β”œβ”€β”€ environment.py # step() / reset() / state() implementation
β”‚ └── services.py # Alert generation, dependency cascade, data formatting
β”‚
β”œβ”€β”€ tasks/ # Task definitions (auto-discovered)
β”‚ β”œβ”€β”€ __init__.py # Auto-discovery loader β†’ SCENARIOS dict
β”‚ β”œβ”€β”€ easy_oom.py # Easy: Single Service OOM Crash
β”‚ β”œβ”€β”€ medium_deadlock.py # Medium: Cascading Database Deadlock
β”‚ └── hard_concurrent.py # Hard: Concurrent Faults + Misleading Evidence
β”‚
β”œβ”€β”€ graders/ # Scoring engine
β”‚ β”œβ”€β”€ __init__.py
β”‚ └── grader.py # Generic rubric-based grader (0.0-1.0)
β”‚
└── server/ # FastAPI web server
β”œβ”€β”€ __init__.py
└── app.py # /reset, /step, /state, /tasks endpoints
```
## Action Space
All actions are sent as a single JSON object with an `action_type` field. Optional fields depend on the action type.
### Investigation Actions (read-only, gather information)
| Action | Required Fields | Returns |
|--------|----------------|---------|
| `read_logs` | `service` | 50+ timestamped log lines with noise and signal |
| `check_metrics` | `service` | Time-series table (CPU, memory, latency, error rate, etc.) |
| `ping_service` | `service` | Reachability check with latency |
| `check_dependencies` | `service` | Upstream dependency list with current health status |
| `inspect_deploy` | `service` | Deploy history (version, timestamp, status) |
| `query_traces` | `service` | Distributed trace spans showing latency breakdown |
| `check_runbook` | `service` | Operational runbook with troubleshooting steps |
| `diff_config` | `service` | Current vs previous config comparison |
### Remediation Actions (modify environment state)
| Action | Required Fields | Effect |
|--------|----------------|--------|
| `restart_service` | `service` | Restarts pods. Fixes OOM/leak issues. No effect if root cause is elsewhere. |
| `rollback_deploy` | `service`, `target_version` | Rolls back to specified version. Must match exact version string. |
| `scale_up` | `service`, `replicas` | Increases replica count. Can alleviate memory pressure. |
| `drain_traffic` | `service` | Stops routing traffic to the service. |
### Terminal Action
| Action | Required Fields | Effect |
|--------|----------------|--------|
| `submit_diagnosis` | `root_cause_service`, `root_cause_category`, `fix_description` | Ends episode, triggers grading. |
### Root Cause Categories
`oom_crash`, `db_deadlock`, `bad_deploy`, `memory_leak`, `network_partition`, `disk_full`, `config_error`, `cert_expiry`, `dns_failure`, `rate_limit`
### Example Actions
```json
{"action_type": "read_logs", "service": "auth-service"}
{"action_type": "check_metrics", "service": "db-postgres"}
{"action_type": "rollback_deploy", "service": "payment-service", "target_version": "v3.8.1"}
{"action_type": "submit_diagnosis", "root_cause_service": "db-postgres", "root_cause_category": "db_deadlock", "fix_description": "Restarted db-postgres to clear deadlock caused by analytics-cron query"}
```
## Observation Space
On `reset()`, the agent receives:
- **Service health dashboard** β€” all 7 services with status (`HEALTHY`/`DEGRADED`/`DOWN`), version, replica count
- **Active alerts** β€” severity-tagged alerts (SEV-1/SEV-2/SEV-3)
- **Incident summary** β€” text description of the situation
On each `step()`, the agent receives:
- **Updated service statuses** β€” health may change after remediation
- **Updated alerts** β€” alerts clear when services recover
- **Action result** β€” the data returned by the action (logs, metrics, traces, etc.)
- **Reward** β€” per-step reward signal
- **Done flag** β€” whether the episode has ended
- **Score** β€” final score (only on terminal step)
### Progressive Revelation
The agent does NOT see all data upfront. It must actively choose which services to investigate and which data to request. Each investigation action consumes a step, creating a planning pressure: the agent must balance information gathering with remediation within the step budget.
### Post-Remediation Feedback
After any remediation action, the observation includes a `[POST-REMEDIATION CHECK]` that lists which services are still unhealthy. This is critical for the hard task β€” after fixing only one of two faults, the check reveals remaining issues.
## Reward Function
### Per-Step Shaping
| Action | Reward |
|--------|--------|
| Investigating a root-cause service | +0.01 |
| Investigating a non-root-cause service | 0.00 |
| Correct remediation (matches required fix) | +0.05 |
| Wrong remediation (wrong service or wrong fix type) | -0.05 |
### Terminal Grading (0.0 - 1.0)
The grader is generic and rubric-based. Each task defines its own weights:
| Component | Easy | Medium | Hard |
|-----------|------|--------|------|
| Correct root cause service identified | 0.30 | 0.25 | 0.15 |
| Correct root cause category | 0.20 | 0.20 | 0.10 |
| Primary fix applied | 0.30 | 0.25 | 0.15 |
| Secondary fix(es) applied | -- | -- | 0.20 |
| Diagnosis text quality (keyword match) | 0.10 | 0.10 | 0.15 |
| Investigation thoroughness | 0.10 | 0.10 | 0.10 |
| Wrong remediation penalty | -0.03/ea | -0.05/ea | -0.05/ea |
**Diagnosis text scoring** uses deterministic keyword matching β€” the grader checks if the fix description mentions key terms (service names, fault types, fix actions). No LLM-based judging.
**Investigation thoroughness** checks whether the agent examined at least one root-cause service before submitting.
## Setup
### Local Development
```bash
pip install -r requirements.txt
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Docker
```bash
docker build -t sre-incident-response .
docker run -p 8000:8000 sre-incident-response
```
### API Usage
```bash
# List available tasks
curl http://localhost:8000/tasks
# Reset (start a new episode)
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "easy"}'
# Step (take an action)
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"session_id": "<SESSION_ID>", "action": {"action_type": "read_logs", "service": "auth-service"}}'
# Get current episode state
curl http://localhost:8000/state/<SESSION_ID>
```
OpenEnv-prefixed endpoints are also available: `/openenv/reset`, `/openenv/step`, `/openenv/state/{session_id}`, `/openenv/tasks`.
### Running Inference
```bash
export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
python inference.py
```
The inference script runs a baseline LLM agent against all tasks, emitting structured stdout logs:
```
[START] task=easy env=sre_incident_response model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=read_logs(auth-service) reward=0.01 done=false error=null
[STEP] step=2 action=check_metrics(auth-service) reward=0.01 done=false error=null
[STEP] step=3 action=restart_service(auth-service) reward=0.05 done=false error=null
[STEP] step=4 action=submit_diagnosis reward=1.00 done=true error=null
[END] success=true steps=4 score=1.00 rewards=0.01,0.01,0.05,1.00
```
## Baseline Scores
| Task | Expected Score Range | What a Perfect Agent Scores |
|------|---------------------|---------------------------|
| easy | 0.70 - 0.95 | 1.00 |
| medium | 0.40 - 0.75 | 0.90 |
| hard | 0.20 - 0.55 | 0.85 |
## Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
| `HF_TOKEN` | HuggingFace API key | Required |
| `PORT` | Server port | `8000` |
| `SRE_TASKS` | Comma-separated task IDs to run in inference | `easy,medium,hard` |
## OpenEnv Spec Compliance
- `openenv.yaml` with metadata, task definitions, typed models, and runtime config
- `step(action)` returns observation, reward, done, info
- `reset()` returns initial observation
- `state()` returns current episode metadata
- Typed Pydantic models for Action, Observation, and State
- 3 tasks with programmatic graders (easy, medium, hard)
- Scores in 0.0-1.0 range with partial progress signals
- Working Dockerfile for containerized execution
- Baseline inference script (`inference.py`) with reproducible scores