Spaces:
Sleeping
title: OpenEnv-Sentinel
emoji: π¨
colorFrom: red
colorTo: yellow
sdk: docker
pinned: false
app_port: 8000
tags:
- openenv
base_path: /web
OpenEnv-Sentinel: SRE Incident Triage Environment
An OpenEnv environment that simulates SRE incident triage. An AI agent receives a degraded system state and must use diagnostic tools to identify the root cause and recommend a fix.
Quick Start
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
Or with Docker:
docker build -t sentinel-env -f server/Dockerfile .
docker run -p 8000:8000 sentinel-env
Action Space
class SentinelAction(Action):
tool_name: str # Tool to invoke
parameters: dict # Tool-specific parameters
Available Tools
| Tool | Parameters | Description |
|---|---|---|
query_logs |
service, query, severity |
Search service logs |
query_metrics |
service, metric |
Get time-series metrics (cpu/memory/error_rate/latency/connections) |
get_service_status |
service |
Service health, uptime, errors |
get_dependency_map |
service (optional) |
Service dependency graph |
consult_runbook |
topic |
SOP/runbook lookup |
check_recent_changes |
service (optional) |
Recent deployments/config changes |
submit_resolution |
root_cause, affected_service, recommendation |
Submit final answer (ends episode) |
Observation Space
class SentinelObservation(Observation):
incident_summary: str # Alert description
tool_output: str # Result from last tool call
available_tools: list[str] # Available tool names
step_number: int # Current step (0-indexed)
max_steps: int # Episode limit (20)
cumulative_reward: float # Running reward total
last_action_error: str # Error message if action was invalid
done: bool # Episode finished?
reward: float | None # Per-step reward
Tasks
Task 1 β The Smoking Gun (Easy)
Alert: payment-api returning HTTP 500 errors. Straightforward single-service crash with a clear root cause in logs and deploy history. Optimal: 2β3 tool calls.
Task 2 β The Upstream Culprit (Medium)
Alert: checkout-service p99 latency > 5 seconds. Requires tracing a dependency chain to find the real culprit (inventory-service OOM). Optimal: 4β6 tool calls.
Task 3 β The Cascading Failure (Hard)
Alert: Multiple services degraded simultaneously. A long-running analytics query exhausts the PostgreSQL connection pool, cascading through auth, user-profile, and notification services. Includes red herrings. Optimal: 6β10 tool calls.
Scoring
Each task is scored 0.0β1.0 using deterministic keyword-based grading:
- Root cause identification (weighted by task)
- Correct affected service identification
- Actionable recommendation
- Efficiency bonus (fewer steps = higher score)
- Destructive penalty (recommending harmful actions = score deduction)
Per-step rewards provide partial credit signal:
- Relevant tool call: +0.12
- Irrelevant tool call: β0.02
- Repeated call: β0.05
- Invalid action: β0.03
- Step cost: β0.01
Running Inference
Uses OpenAI(base_url=...) β compatible with HF Inference, OpenAI, and any
OpenAI-compatible API.
# Environment server URL
export ENV_URL=http://localhost:8000
# LLM config (defaults to HF router)
export API_BASE_URL=https://router.huggingface.co/v1 # default, can omit
export MODEL_NAME=openai/gpt-oss-120b:novita # default, can omit
export API_KEY=your-key # or HF_TOKEN or OPENAI_API_KEY
pip install openai websockets
python inference.py
Output:
Task 1: 0.85
Task 2: 0.65
Task 3: 0.40
Average: 0.63
Baseline Scores
| Task | GPT-4o (expected) | Open LLM (expected) |
|---|---|---|
| Task 1 (Easy) | 0.80β0.95 | 0.60β0.80 |
| Task 2 (Medium) | 0.60β0.80 | 0.40β0.60 |
| Task 3 (Hard) | 0.30β0.60 | 0.15β0.35 |
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/reset |
POST | Reset environment ({"task_id": 1|2|3}) |
/step |
POST | Execute action ({"action": {...}}) |
/state |
GET | Get current state |
/schema |
GET | JSON schemas for action/observation/state |
/ws |
WebSocket | Persistent session |