Spaces:
Sleeping
title: ReleaseOps-Env
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
tags:
- openenv
- reinforcement-learning
- sre
- release-management
- benchmark
ReleaseOps-Env
A production-grade OpenEnv benchmark for evaluating whether AI agents can safely approve, canary, pause, or roll back risky software changes under incomplete information.
Agents act as SRE reviewers: investigate a proposed change, gather evidence, and submit a final decision. The environment rewards thorough investigation and correct decisions, and penalizes wasted steps and missed risks.
Setup
pip install -e ".[dev]"
# Seed the real incident database (requires GitHub PAT with public_repo scope)
GITHUB_TOKEN=<your_token> python3 scripts/seed_db.py
# Or run without a token β uses the 12 curated SRE incidents bundled in the repo
python3 scripts/seed_db.py
The incident database (data/incidents.db) is pre-seeded with 100+ real incidents from
GitHub Issues (prometheus/prometheus, kubernetes/kubernetes) and curated post-mortems
from companies including Cloudflare, Stripe, AWS, PagerDuty, and Discord. The
search_incidents tool queries this real SQLite database, not static JSON.
Running Locally
# Start the server
uvicorn server.app:app --port 7860
# In another terminal β run inference (requires MODEL_NAME + API key)
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct"
export HF_TOKEN="hf_..."
export ENV_URL="http://localhost:7860"
python3 inference.py
# Or test locally without a server (no API key needed)
python3 local.py all --trace
Quick Start (API)
# Reset to a task
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "easy_001"}'
# Take a step
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"action": {"action_type": "inspect_change", "section": "diff"}}'
# List tasks and schemas
curl http://localhost:7860/tasks
# Run deterministic baseline (no API key needed)
curl -X POST http://localhost:7860/baseline
Tasks
| Task | Difficulty | Optimal Decision | Description |
|---|---|---|---|
easy_001 |
Easy | request_changes |
Synchronous audit logging on payment hot path β obvious latency risk |
easy_002 |
Easy | request_changes |
Connection pool increase risks DB exhaustion β missing DBA approval |
medium_001 |
Medium | approve |
Backward-compatible DB index migration β all approvals in place |
medium_002 |
Medium | approve |
JWT HS256βRS256 migration β backward-compatible, all checks pass |
hard_001 |
Hard | request_changes |
Multi-service retry/concurrency change β requires live telemetry to detect payments-service degradation |
hard_002 |
Hard | block |
Rate limit removal from API gateway β requires telemetry to confirm traffic surge risk |
Action Space
| Action | Parameters | Description |
|---|---|---|
inspect_change |
section: diff|tests|approvals|files_changed |
Read the proposed change |
inspect_services |
service: name |
Check service health and SLA metrics |
inspect_dependencies |
β | View blast radius and dependency graph |
search_incidents |
keywords: list |
Search historical incident database |
check_policy |
β | Evaluate current rollout policy rules |
query_telemetry |
metric, service, window |
Query live metrics per rollout phase |
request_artifact |
artifact_type |
Fetch load tests, rollback plans, approvals |
control_rollout |
decision: start_canary|promote|pause|rollback |
Advance the rollout state machine |
submit_decision |
final_decision, reason_codes |
End the episode with a final verdict |
Observation Space
| Field | Type | Description |
|---|---|---|
task_id |
str | Current task identifier |
change_summary |
str | One-line description of the proposed change |
known_risk_signals |
list[RiskSignal] | Risks discovered so far (signal_id, severity, summary) |
last_tool_result |
ToolResult | Result of the last action taken |
allowed_actions |
list[str] | Actions valid in the current rollout phase |
rollout_phase |
str | precheck β canary β promoted | rolled_back |
time_remaining |
int | Steps remaining before timeout |
cumulative_reward |
float | Running reward total |
final_score |
float|null | Grader score strictly between 0 and 1 (set on terminal step) |
Grading Formula
score = 0.35 * evidence_coverage
+ 0.25 * risk_signal_discovery
+ 0.30 * decision_correctness
+ 0.10 * efficiency
- 0.30 * forbidden_penalty
Scores normalized to strict bounds (0, 1), i.e. [0.001, 0.999]. Fully deterministic β no LLM judge.
- evidence_coverage: fraction of required evidence sources the agent inspected
- risk_signal_discovery: fraction of required risk signals the environment emitted during the episode (objective β measures what the agent actually observed, not what strings it typed)
- decision_correctness: 1.0 for optimal decision, 0.5 for acceptable, 0.0 for wrong
- efficiency: peaks at 1.0 for 30β70% step usage, degrades toward 0 at extremes
Hard tasks require query_telemetry to discover critical pre-deployment anomalies. A rule-based
agent that skips telemetry inspection will score ~0.77 on hard tasks, while an agent that
queries live metrics across all affected services scores ~0.98. Easy/medium tasks are solvable
without telemetry.
Baseline Scores (Heuristic Agent)
| Task | Score | Decision |
|---|---|---|
| easy_001 | 0.983 | request_changes |
| easy_002 | 0.983 | request_changes |
| medium_001 | 0.983 | approve |
| medium_002 | 0.983 | approve |
| hard_001 | 0.773 | request_changes |
| hard_002 | 0.760 | block |
| Average | 0.911 |
The gap between easy (0.983) and hard (0.767) scores reflects genuine difficulty: hard tasks
require query_telemetry on multiple services to surface pre-deployment metric anomalies that
static diff/test inspection cannot reveal.
Heuristic baseline runs via curl -X POST http://localhost:7860/baseline β no LLM required.
Validator Parity Checks
openenv validate
python3 scripts/validator_parity_check.py
pytest -q
CI runs the same checks in .github/workflows/validator-parity.yml on every push/PR.
Rollout State Machine
precheck --start_canary--> canary --promote--> promoted [terminal]
|
rollback --> rolled_back [terminal]
submit_decision ends the episode from any phase.
Running Inference Script
export API_BASE_URL="https://openrouter.ai/api/v1" # or any OpenAI-compatible endpoint
export MODEL_NAME="meta-llama/llama-3.3-70b-instruct"
export OPENAI_API_KEY="sk-..." # or HF_TOKEN
export ENV_URL="https://your-space.hf.space"
python3 inference.py