--- title: OpenEnv-Sentinel emoji: 🚨 colorFrom: red colorTo: yellow sdk: docker pinned: false app_port: 8000 tags: - openenv base_path: /web --- # OpenEnv-Sentinel: SRE Incident Triage Environment An OpenEnv environment that simulates SRE incident triage. An AI agent receives a degraded system state and must use diagnostic tools to identify the root cause and recommend a fix. ## Quick Start ```bash pip install -e . uvicorn server.app:app --host 0.0.0.0 --port 8000 ``` Or with Docker: ```bash docker build -t sentinel-env -f server/Dockerfile . docker run -p 8000:8000 sentinel-env ``` ## Action Space ```python class SentinelAction(Action): tool_name: str # Tool to invoke parameters: dict # Tool-specific parameters ``` ### Available Tools | Tool | Parameters | Description | |---|---|---| | `query_logs` | `service`, `query`, `severity` | Search service logs | | `query_metrics` | `service`, `metric` | Get time-series metrics (cpu/memory/error_rate/latency/connections) | | `get_service_status` | `service` | Service health, uptime, errors | | `get_dependency_map` | `service` (optional) | Service dependency graph | | `consult_runbook` | `topic` | SOP/runbook lookup | | `check_recent_changes` | `service` (optional) | Recent deployments/config changes | | `submit_resolution` | `root_cause`, `affected_service`, `recommendation` | Submit final answer (ends episode) | ## Observation Space ```python class SentinelObservation(Observation): incident_summary: str # Alert description tool_output: str # Result from last tool call available_tools: list[str] # Available tool names step_number: int # Current step (0-indexed) max_steps: int # Episode limit (20) cumulative_reward: float # Running reward total last_action_error: str # Error message if action was invalid done: bool # Episode finished? reward: float | None # Per-step reward ``` ## Tasks ### Task 1 — The Smoking Gun (Easy) **Alert:** payment-api returning HTTP 500 errors. Straightforward single-service crash with a clear root cause in logs and deploy history. Optimal: 2–3 tool calls. ### Task 2 — The Upstream Culprit (Medium) **Alert:** checkout-service p99 latency > 5 seconds. Requires tracing a dependency chain to find the real culprit (inventory-service OOM). Optimal: 4–6 tool calls. ### Task 3 — The Cascading Failure (Hard) **Alert:** Multiple services degraded simultaneously. A long-running analytics query exhausts the PostgreSQL connection pool, cascading through auth, user-profile, and notification services. Includes red herrings. Optimal: 6–10 tool calls. ## Scoring Each task is scored 0.0–1.0 using deterministic keyword-based grading: - **Root cause identification** (weighted by task) - **Correct affected service** identification - **Actionable recommendation** - **Efficiency bonus** (fewer steps = higher score) - **Destructive penalty** (recommending harmful actions = score deduction) Per-step rewards provide partial credit signal: - Relevant tool call: +0.12 - Irrelevant tool call: −0.02 - Repeated call: −0.05 - Invalid action: −0.03 - Step cost: −0.01 ## Running Inference Uses `OpenAI(base_url=...)` — compatible with HF Inference, OpenAI, and any OpenAI-compatible API. ```bash # Environment server URL export ENV_URL=http://localhost:8000 # LLM config (defaults to HF router) export API_BASE_URL=https://router.huggingface.co/v1 # default, can omit export MODEL_NAME=openai/gpt-oss-120b:novita # default, can omit export API_KEY=your-key # or HF_TOKEN or OPENAI_API_KEY pip install openai websockets python inference.py ``` Output: ``` Task 1: 0.85 Task 2: 0.65 Task 3: 0.40 Average: 0.63 ``` ## Baseline Scores | Task | GPT-4o (expected) | Open LLM (expected) | |---|---|---| | Task 1 (Easy) | 0.80–0.95 | 0.60–0.80 | | Task 2 (Medium) | 0.60–0.80 | 0.40–0.60 | | Task 3 (Hard) | 0.30–0.60 | 0.15–0.35 | ## API Endpoints | Endpoint | Method | Description | |---|---|---| | `/health` | GET | Health check | | `/reset` | POST | Reset environment (`{"task_id": 1\|2\|3}`) | | `/step` | POST | Execute action (`{"action": {...}}`) | | `/state` | GET | Get current state | | `/schema` | GET | JSON schemas for action/observation/state | | `/ws` | WebSocket | Persistent session |