--- title: ReleaseOps-Env emoji: 🚀 colorFrom: blue colorTo: indigo sdk: docker app_port: 7860 tags: - openenv - reinforcement-learning - sre - release-management - benchmark --- # ReleaseOps-Env A production-grade OpenEnv benchmark for evaluating whether AI agents can safely approve, canary, pause, or roll back risky software changes under incomplete information. Agents act as SRE reviewers: investigate a proposed change, gather evidence, and submit a final decision. The environment rewards thorough investigation and correct decisions, and penalizes wasted steps and missed risks. ## Setup ```bash pip install -e ".[dev]" # Seed the real incident database (requires GitHub PAT with public_repo scope) GITHUB_TOKEN= python3 scripts/seed_db.py # Or run without a token — uses the 12 curated SRE incidents bundled in the repo python3 scripts/seed_db.py ``` The incident database (`data/incidents.db`) is pre-seeded with 100+ real incidents from GitHub Issues (prometheus/prometheus, kubernetes/kubernetes) and curated post-mortems from companies including Cloudflare, Stripe, AWS, PagerDuty, and Discord. The `search_incidents` tool queries this real SQLite database, not static JSON. ## Running Locally ```bash # Start the server uvicorn server.app:app --port 7860 # In another terminal — run inference (requires MODEL_NAME + API key) export API_BASE_URL="https://router.huggingface.co/v1" export MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct" export HF_TOKEN="hf_..." export ENV_URL="http://localhost:7860" python3 inference.py # Or test locally without a server (no API key needed) python3 local.py all --trace ``` ## Quick Start (API) ```bash # Reset to a task curl -X POST http://localhost:7860/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "easy_001"}' # Take a step curl -X POST http://localhost:7860/step \ -H "Content-Type: application/json" \ -d '{"action": {"action_type": "inspect_change", "section": "diff"}}' # List tasks and schemas curl http://localhost:7860/tasks # Run deterministic baseline (no API key needed) curl -X POST http://localhost:7860/baseline ``` ## Tasks | Task | Difficulty | Optimal Decision | Description | |------|-----------|-----------------|-------------| | `easy_001` | Easy | `request_changes` | Synchronous audit logging on payment hot path — obvious latency risk | | `easy_002` | Easy | `request_changes` | Connection pool increase risks DB exhaustion — missing DBA approval | | `medium_001` | Medium | `approve` | Backward-compatible DB index migration — all approvals in place | | `medium_002` | Medium | `approve` | JWT HS256→RS256 migration — backward-compatible, all checks pass | | `hard_001` | Hard | `request_changes` | Multi-service retry/concurrency change — requires live telemetry to detect payments-service degradation | | `hard_002` | Hard | `block` | Rate limit removal from API gateway — requires telemetry to confirm traffic surge risk | ## Action Space | Action | Parameters | Description | |--------|-----------|-------------| | `inspect_change` | `section`: diff\|tests\|approvals\|files_changed | Read the proposed change | | `inspect_services` | `service`: name | Check service health and SLA metrics | | `inspect_dependencies` | — | View blast radius and dependency graph | | `search_incidents` | `keywords`: list | Search historical incident database | | `check_policy` | — | Evaluate current rollout policy rules | | `query_telemetry` | `metric`, `service`, `window` | Query live metrics per rollout phase | | `request_artifact` | `artifact_type` | Fetch load tests, rollback plans, approvals | | `control_rollout` | `decision`: start_canary\|promote\|pause\|rollback | Advance the rollout state machine | | `submit_decision` | `final_decision`, `reason_codes` | End the episode with a final verdict | ## Observation Space | Field | Type | Description | |-------|------|-------------| | `task_id` | str | Current task identifier | | `change_summary` | str | One-line description of the proposed change | | `known_risk_signals` | list[RiskSignal] | Risks discovered so far (signal_id, severity, summary) | | `last_tool_result` | ToolResult | Result of the last action taken | | `allowed_actions` | list[str] | Actions valid in the current rollout phase | | `rollout_phase` | str | precheck → canary → promoted \| rolled_back | | `time_remaining` | int | Steps remaining before timeout | | `cumulative_reward` | float | Running reward total | | `final_score` | float\|null | Grader score strictly between 0 and 1 (set on terminal step) | ## Grading Formula ``` score = 0.35 * evidence_coverage + 0.25 * risk_signal_discovery + 0.30 * decision_correctness + 0.10 * efficiency - 0.30 * forbidden_penalty ``` Scores normalized to strict bounds (0, 1), i.e. [0.001, 0.999]. Fully deterministic — no LLM judge. - **evidence_coverage**: fraction of required evidence sources the agent inspected - **risk_signal_discovery**: fraction of required risk signals the environment emitted during the episode (objective — measures what the agent actually observed, not what strings it typed) - **decision_correctness**: 1.0 for optimal decision, 0.5 for acceptable, 0.0 for wrong - **efficiency**: peaks at 1.0 for 30–70% step usage, degrades toward 0 at extremes Hard tasks require `query_telemetry` to discover critical pre-deployment anomalies. A rule-based agent that skips telemetry inspection will score ~0.77 on hard tasks, while an agent that queries live metrics across all affected services scores ~0.98. Easy/medium tasks are solvable without telemetry. ## Baseline Scores (Heuristic Agent) | Task | Score | Decision | |------|-------|----------| | easy_001 | 0.983 | request_changes | | easy_002 | 0.983 | request_changes | | medium_001 | 0.983 | approve | | medium_002 | 0.983 | approve | | hard_001 | 0.773 | request_changes | | hard_002 | 0.760 | block | | **Average** | **0.911** | | The gap between easy (0.983) and hard (0.767) scores reflects genuine difficulty: hard tasks require `query_telemetry` on multiple services to surface pre-deployment metric anomalies that static diff/test inspection cannot reveal. Heuristic baseline runs via `curl -X POST http://localhost:7860/baseline` — no LLM required. ## Validator Parity Checks ```bash openenv validate python3 scripts/validator_parity_check.py pytest -q ``` CI runs the same checks in `.github/workflows/validator-parity.yml` on every push/PR. ## Rollout State Machine ``` precheck --start_canary--> canary --promote--> promoted [terminal] | rollback --> rolled_back [terminal] submit_decision ends the episode from any phase. ``` ## Running Inference Script ```bash export API_BASE_URL="https://openrouter.ai/api/v1" # or any OpenAI-compatible endpoint export MODEL_NAME="meta-llama/llama-3.3-70b-instruct" export OPENAI_API_KEY="sk-..." # or HF_TOKEN export ENV_URL="https://your-space.hf.space" python3 inference.py ``` # ReleaseOps_OpenEnv # refresh