Spaces:
Sleeping
Sleeping
| title: OpenEnv-Sentinel | |
| emoji: π¨ | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| base_path: /web | |
| # OpenEnv-Sentinel: SRE Incident Triage Environment | |
| An OpenEnv environment that simulates SRE incident triage. An AI agent receives a degraded system state and must use diagnostic tools to identify the root cause and recommend a fix. | |
| ## Quick Start | |
| ```bash | |
| pip install -e . | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| Or with Docker: | |
| ```bash | |
| docker build -t sentinel-env -f server/Dockerfile . | |
| docker run -p 8000:8000 sentinel-env | |
| ``` | |
| ## Action Space | |
| ```python | |
| class SentinelAction(Action): | |
| tool_name: str # Tool to invoke | |
| parameters: dict # Tool-specific parameters | |
| ``` | |
| ### Available Tools | |
| | Tool | Parameters | Description | | |
| |---|---|---| | |
| | `query_logs` | `service`, `query`, `severity` | Search service logs | | |
| | `query_metrics` | `service`, `metric` | Get time-series metrics (cpu/memory/error_rate/latency/connections) | | |
| | `get_service_status` | `service` | Service health, uptime, errors | | |
| | `get_dependency_map` | `service` (optional) | Service dependency graph | | |
| | `consult_runbook` | `topic` | SOP/runbook lookup | | |
| | `check_recent_changes` | `service` (optional) | Recent deployments/config changes | | |
| | `submit_resolution` | `root_cause`, `affected_service`, `recommendation` | Submit final answer (ends episode) | | |
| ## Observation Space | |
| ```python | |
| class SentinelObservation(Observation): | |
| incident_summary: str # Alert description | |
| tool_output: str # Result from last tool call | |
| available_tools: list[str] # Available tool names | |
| step_number: int # Current step (0-indexed) | |
| max_steps: int # Episode limit (20) | |
| cumulative_reward: float # Running reward total | |
| last_action_error: str # Error message if action was invalid | |
| done: bool # Episode finished? | |
| reward: float | None # Per-step reward | |
| ``` | |
| ## Tasks | |
| ### Task 1 β The Smoking Gun (Easy) | |
| **Alert:** payment-api returning HTTP 500 errors. Straightforward single-service crash with a clear root cause in logs and deploy history. Optimal: 2β3 tool calls. | |
| ### Task 2 β The Upstream Culprit (Medium) | |
| **Alert:** checkout-service p99 latency > 5 seconds. Requires tracing a dependency chain to find the real culprit (inventory-service OOM). Optimal: 4β6 tool calls. | |
| ### Task 3 β The Cascading Failure (Hard) | |
| **Alert:** Multiple services degraded simultaneously. A long-running analytics query exhausts the PostgreSQL connection pool, cascading through auth, user-profile, and notification services. Includes red herrings. Optimal: 6β10 tool calls. | |
| ## Scoring | |
| Each task is scored 0.0β1.0 using deterministic keyword-based grading: | |
| - **Root cause identification** (weighted by task) | |
| - **Correct affected service** identification | |
| - **Actionable recommendation** | |
| - **Efficiency bonus** (fewer steps = higher score) | |
| - **Destructive penalty** (recommending harmful actions = score deduction) | |
| Per-step rewards provide partial credit signal: | |
| - Relevant tool call: +0.12 | |
| - Irrelevant tool call: β0.02 | |
| - Repeated call: β0.05 | |
| - Invalid action: β0.03 | |
| - Step cost: β0.01 | |
| ## Running Inference | |
| Uses `OpenAI(base_url=...)` β compatible with HF Inference, OpenAI, and any | |
| OpenAI-compatible API. | |
| ```bash | |
| # Environment server URL | |
| export ENV_URL=http://localhost:8000 | |
| # LLM config (defaults to HF router) | |
| export API_BASE_URL=https://router.huggingface.co/v1 # default, can omit | |
| export MODEL_NAME=openai/gpt-oss-120b:novita # default, can omit | |
| export API_KEY=your-key # or HF_TOKEN or OPENAI_API_KEY | |
| pip install openai websockets | |
| python inference.py | |
| ``` | |
| Output: | |
| ``` | |
| Task 1: 0.85 | |
| Task 2: 0.65 | |
| Task 3: 0.40 | |
| Average: 0.63 | |
| ``` | |
| ## Baseline Scores | |
| | Task | GPT-4o (expected) | Open LLM (expected) | | |
| |---|---|---| | |
| | Task 1 (Easy) | 0.80β0.95 | 0.60β0.80 | | |
| | Task 2 (Medium) | 0.60β0.80 | 0.40β0.60 | | |
| | Task 3 (Hard) | 0.30β0.60 | 0.15β0.35 | | |
| ## API Endpoints | |
| | Endpoint | Method | Description | | |
| |---|---|---| | |
| | `/health` | GET | Health check | | |
| | `/reset` | POST | Reset environment (`{"task_id": 1\|2\|3}`) | | |
| | `/step` | POST | Execute action (`{"action": {...}}`) | | |
| | `/state` | GET | Get current state | | |
| | `/schema` | GET | JSON schemas for action/observation/state | | |
| | `/ws` | WebSocket | Persistent session | | |