Spaces:
Sleeping
Sleeping
| title: ReleaseOps-Env | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| app_port: 7860 | |
| tags: | |
| - openenv | |
| - reinforcement-learning | |
| - sre | |
| - release-management | |
| - benchmark | |
| # ReleaseOps-Env | |
| A production-grade OpenEnv benchmark for evaluating whether AI agents can safely approve, canary, pause, or roll back risky software changes under incomplete information. | |
| Agents act as SRE reviewers: investigate a proposed change, gather evidence, and submit a final decision. The environment rewards thorough investigation and correct decisions, and penalizes wasted steps and missed risks. | |
| ## Setup | |
| ```bash | |
| pip install -e ".[dev]" | |
| # Seed the real incident database (requires GitHub PAT with public_repo scope) | |
| GITHUB_TOKEN=<your_token> python3 scripts/seed_db.py | |
| # Or run without a token β uses the 12 curated SRE incidents bundled in the repo | |
| python3 scripts/seed_db.py | |
| ``` | |
| The incident database (`data/incidents.db`) is pre-seeded with 100+ real incidents from | |
| GitHub Issues (prometheus/prometheus, kubernetes/kubernetes) and curated post-mortems | |
| from companies including Cloudflare, Stripe, AWS, PagerDuty, and Discord. The | |
| `search_incidents` tool queries this real SQLite database, not static JSON. | |
| ## Running Locally | |
| ```bash | |
| # Start the server | |
| uvicorn server.app:app --port 7860 | |
| # In another terminal β run inference (requires MODEL_NAME + API key) | |
| export API_BASE_URL="https://router.huggingface.co/v1" | |
| export MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct" | |
| export HF_TOKEN="hf_..." | |
| export ENV_URL="http://localhost:7860" | |
| python3 inference.py | |
| # Or test locally without a server (no API key needed) | |
| python3 local.py all --trace | |
| ``` | |
| ## Quick Start (API) | |
| ```bash | |
| # Reset to a task | |
| curl -X POST http://localhost:7860/reset \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"task_id": "easy_001"}' | |
| # Take a step | |
| curl -X POST http://localhost:7860/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action": {"action_type": "inspect_change", "section": "diff"}}' | |
| # List tasks and schemas | |
| curl http://localhost:7860/tasks | |
| # Run deterministic baseline (no API key needed) | |
| curl -X POST http://localhost:7860/baseline | |
| ``` | |
| ## Tasks | |
| | Task | Difficulty | Optimal Decision | Description | | |
| |------|-----------|-----------------|-------------| | |
| | `easy_001` | Easy | `request_changes` | Synchronous audit logging on payment hot path β obvious latency risk | | |
| | `easy_002` | Easy | `request_changes` | Connection pool increase risks DB exhaustion β missing DBA approval | | |
| | `medium_001` | Medium | `approve` | Backward-compatible DB index migration β all approvals in place | | |
| | `medium_002` | Medium | `approve` | JWT HS256βRS256 migration β backward-compatible, all checks pass | | |
| | `hard_001` | Hard | `request_changes` | Multi-service retry/concurrency change β requires live telemetry to detect payments-service degradation | | |
| | `hard_002` | Hard | `block` | Rate limit removal from API gateway β requires telemetry to confirm traffic surge risk | | |
| ## Action Space | |
| | Action | Parameters | Description | | |
| |--------|-----------|-------------| | |
| | `inspect_change` | `section`: diff\|tests\|approvals\|files_changed | Read the proposed change | | |
| | `inspect_services` | `service`: name | Check service health and SLA metrics | | |
| | `inspect_dependencies` | β | View blast radius and dependency graph | | |
| | `search_incidents` | `keywords`: list | Search historical incident database | | |
| | `check_policy` | β | Evaluate current rollout policy rules | | |
| | `query_telemetry` | `metric`, `service`, `window` | Query live metrics per rollout phase | | |
| | `request_artifact` | `artifact_type` | Fetch load tests, rollback plans, approvals | | |
| | `control_rollout` | `decision`: start_canary\|promote\|pause\|rollback | Advance the rollout state machine | | |
| | `submit_decision` | `final_decision`, `reason_codes` | End the episode with a final verdict | | |
| ## Observation Space | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | `task_id` | str | Current task identifier | | |
| | `change_summary` | str | One-line description of the proposed change | | |
| | `known_risk_signals` | list[RiskSignal] | Risks discovered so far (signal_id, severity, summary) | | |
| | `last_tool_result` | ToolResult | Result of the last action taken | | |
| | `allowed_actions` | list[str] | Actions valid in the current rollout phase | | |
| | `rollout_phase` | str | precheck β canary β promoted \| rolled_back | | |
| | `time_remaining` | int | Steps remaining before timeout | | |
| | `cumulative_reward` | float | Running reward total | | |
| | `final_score` | float\|null | Grader score strictly between 0 and 1 (set on terminal step) | | |
| ## Grading Formula | |
| ``` | |
| score = 0.35 * evidence_coverage | |
| + 0.25 * risk_signal_discovery | |
| + 0.30 * decision_correctness | |
| + 0.10 * efficiency | |
| - 0.30 * forbidden_penalty | |
| ``` | |
| Scores normalized to strict bounds (0, 1), i.e. [0.001, 0.999]. Fully deterministic β no LLM judge. | |
| - **evidence_coverage**: fraction of required evidence sources the agent inspected | |
| - **risk_signal_discovery**: fraction of required risk signals the environment emitted during the episode (objective β measures what the agent actually observed, not what strings it typed) | |
| - **decision_correctness**: 1.0 for optimal decision, 0.5 for acceptable, 0.0 for wrong | |
| - **efficiency**: peaks at 1.0 for 30β70% step usage, degrades toward 0 at extremes | |
| Hard tasks require `query_telemetry` to discover critical pre-deployment anomalies. A rule-based | |
| agent that skips telemetry inspection will score ~0.77 on hard tasks, while an agent that | |
| queries live metrics across all affected services scores ~0.98. Easy/medium tasks are solvable | |
| without telemetry. | |
| ## Baseline Scores (Heuristic Agent) | |
| | Task | Score | Decision | | |
| |------|-------|----------| | |
| | easy_001 | 0.983 | request_changes | | |
| | easy_002 | 0.983 | request_changes | | |
| | medium_001 | 0.983 | approve | | |
| | medium_002 | 0.983 | approve | | |
| | hard_001 | 0.773 | request_changes | | |
| | hard_002 | 0.760 | block | | |
| | **Average** | **0.911** | | | |
| The gap between easy (0.983) and hard (0.767) scores reflects genuine difficulty: hard tasks | |
| require `query_telemetry` on multiple services to surface pre-deployment metric anomalies that | |
| static diff/test inspection cannot reveal. | |
| Heuristic baseline runs via `curl -X POST http://localhost:7860/baseline` β no LLM required. | |
| ## Validator Parity Checks | |
| ```bash | |
| openenv validate | |
| python3 scripts/validator_parity_check.py | |
| pytest -q | |
| ``` | |
| CI runs the same checks in `.github/workflows/validator-parity.yml` on every push/PR. | |
| ## Rollout State Machine | |
| ``` | |
| precheck --start_canary--> canary --promote--> promoted [terminal] | |
| | | |
| rollback --> rolled_back [terminal] | |
| submit_decision ends the episode from any phase. | |
| ``` | |
| ## Running Inference Script | |
| ```bash | |
| export API_BASE_URL="https://openrouter.ai/api/v1" # or any OpenAI-compatible endpoint | |
| export MODEL_NAME="meta-llama/llama-3.3-70b-instruct" | |
| export OPENAI_API_KEY="sk-..." # or HF_TOKEN | |
| export ENV_URL="https://your-space.hf.space" | |
| python3 inference.py | |
| ``` | |
| # ReleaseOps_OpenEnv | |
| # refresh | |