Spaces:

NoNameFound
/

sentinel_env

Sleeping

App Files Files Community

sentinel_env / README.md

NoNameFound

Upload folder using huggingface_hub

9530b60 verified about 2 months ago

preview code

raw

history blame contribute delete

4.39 kB

	---
	title: OpenEnv-Sentinel
	emoji: 🚨
	colorFrom: red
	colorTo: yellow
	sdk: docker
	pinned: false
	app_port: 8000
	tags:
	- openenv
	base_path: /web
	---

	# OpenEnv-Sentinel: SRE Incident Triage Environment

	An OpenEnv environment that simulates SRE incident triage. An AI agent receives a degraded system state and must use diagnostic tools to identify the root cause and recommend a fix.

	## Quick Start

	```bash
	pip install -e .
	uvicorn server.app:app --host 0.0.0.0 --port 8000
	```

	Or with Docker:

	```bash
	docker build -t sentinel-env -f server/Dockerfile .
	docker run -p 8000:8000 sentinel-env
	```

	## Action Space

	```python
	class SentinelAction(Action):
	tool_name: str # Tool to invoke
	parameters: dict # Tool-specific parameters
	```

	### Available Tools

	\| Tool \| Parameters \| Description \|
	\|---\|---\|---\|
	\| `query_logs` \| `service`, `query`, `severity` \| Search service logs \|
	\| `query_metrics` \| `service`, `metric` \| Get time-series metrics (cpu/memory/error_rate/latency/connections) \|
	\| `get_service_status` \| `service` \| Service health, uptime, errors \|
	\| `get_dependency_map` \| `service` (optional) \| Service dependency graph \|
	\| `consult_runbook` \| `topic` \| SOP/runbook lookup \|
	\| `check_recent_changes` \| `service` (optional) \| Recent deployments/config changes \|
	\| `submit_resolution` \| `root_cause`, `affected_service`, `recommendation` \| Submit final answer (ends episode) \|

	## Observation Space

	```python
	class SentinelObservation(Observation):
	incident_summary: str # Alert description
	tool_output: str # Result from last tool call
	available_tools: list[str] # Available tool names
	step_number: int # Current step (0-indexed)
	max_steps: int # Episode limit (20)
	cumulative_reward: float # Running reward total
	last_action_error: str # Error message if action was invalid
	done: bool # Episode finished?
	reward: float \| None # Per-step reward
	```

	## Tasks

	### Task 1 — The Smoking Gun (Easy)
	Alert: payment-api returning HTTP 500 errors. Straightforward single-service crash with a clear root cause in logs and deploy history. Optimal: 2–3 tool calls.

	### Task 2 — The Upstream Culprit (Medium)
	Alert: checkout-service p99 latency > 5 seconds. Requires tracing a dependency chain to find the real culprit (inventory-service OOM). Optimal: 4–6 tool calls.

	### Task 3 — The Cascading Failure (Hard)
	Alert: Multiple services degraded simultaneously. A long-running analytics query exhausts the PostgreSQL connection pool, cascading through auth, user-profile, and notification services. Includes red herrings. Optimal: 6–10 tool calls.

	## Scoring

	Each task is scored 0.0–1.0 using deterministic keyword-based grading:
	- Root cause identification (weighted by task)
	- Correct affected service identification
	- Actionable recommendation
	- Efficiency bonus (fewer steps = higher score)
	- Destructive penalty (recommending harmful actions = score deduction)

	Per-step rewards provide partial credit signal:
	- Relevant tool call: +0.12
	- Irrelevant tool call: −0.02
	- Repeated call: −0.05
	- Invalid action: −0.03
	- Step cost: −0.01

	## Running Inference

	Uses `OpenAI(base_url=...)` — compatible with HF Inference, OpenAI, and any
	OpenAI-compatible API.

	```bash
	# Environment server URL
	export ENV_URL=http://localhost:8000

	# LLM config (defaults to HF router)
	export API_BASE_URL=https://router.huggingface.co/v1 # default, can omit
	export MODEL_NAME=openai/gpt-oss-120b:novita # default, can omit
	export API_KEY=your-key # or HF_TOKEN or OPENAI_API_KEY

	pip install openai websockets
	python inference.py
	```

	Output:
	```
	Task 1: 0.85
	Task 2: 0.65
	Task 3: 0.40
	Average: 0.63
	```

	## Baseline Scores

	\| Task \| GPT-4o (expected) \| Open LLM (expected) \|
	\|---\|---\|---\|
	\| Task 1 (Easy) \| 0.80–0.95 \| 0.60–0.80 \|
	\| Task 2 (Medium) \| 0.60–0.80 \| 0.40–0.60 \|
	\| Task 3 (Hard) \| 0.30–0.60 \| 0.15–0.35 \|

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\|---\|---\|---\|
	\| `/health` \| GET \| Health check \|
	\| `/reset` \| POST \| Reset environment (`{"task_id": 1\\|2\\|3}`) \|
	\| `/step` \| POST \| Execute action (`{"action": {...}}`) \|
	\| `/state` \| GET \| Get current state \|
	\| `/schema` \| GET \| JSON schemas for action/observation/state \|
	\| `/ws` \| WebSocket \| Persistent session \|