releaseops-env / README.md
eastbrick's picture
Unify score normalization and add validator parity checks
140d024
metadata
title: ReleaseOps-Env
emoji: πŸš€
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
tags:
  - openenv
  - reinforcement-learning
  - sre
  - release-management
  - benchmark

ReleaseOps-Env

A production-grade OpenEnv benchmark for evaluating whether AI agents can safely approve, canary, pause, or roll back risky software changes under incomplete information.

Agents act as SRE reviewers: investigate a proposed change, gather evidence, and submit a final decision. The environment rewards thorough investigation and correct decisions, and penalizes wasted steps and missed risks.

Setup

pip install -e ".[dev]"

# Seed the real incident database (requires GitHub PAT with public_repo scope)
GITHUB_TOKEN=<your_token> python3 scripts/seed_db.py

# Or run without a token β€” uses the 12 curated SRE incidents bundled in the repo
python3 scripts/seed_db.py

The incident database (data/incidents.db) is pre-seeded with 100+ real incidents from GitHub Issues (prometheus/prometheus, kubernetes/kubernetes) and curated post-mortems from companies including Cloudflare, Stripe, AWS, PagerDuty, and Discord. The search_incidents tool queries this real SQLite database, not static JSON.

Running Locally

# Start the server
uvicorn server.app:app --port 7860

# In another terminal β€” run inference (requires MODEL_NAME + API key)
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct"
export HF_TOKEN="hf_..."
export ENV_URL="http://localhost:7860"
python3 inference.py

# Or test locally without a server (no API key needed)
python3 local.py all --trace

Quick Start (API)

# Reset to a task
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "easy_001"}'

# Take a step
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"action": {"action_type": "inspect_change", "section": "diff"}}'

# List tasks and schemas
curl http://localhost:7860/tasks

# Run deterministic baseline (no API key needed)
curl -X POST http://localhost:7860/baseline

Tasks

Task Difficulty Optimal Decision Description
easy_001 Easy request_changes Synchronous audit logging on payment hot path β€” obvious latency risk
easy_002 Easy request_changes Connection pool increase risks DB exhaustion β€” missing DBA approval
medium_001 Medium approve Backward-compatible DB index migration β€” all approvals in place
medium_002 Medium approve JWT HS256β†’RS256 migration β€” backward-compatible, all checks pass
hard_001 Hard request_changes Multi-service retry/concurrency change β€” requires live telemetry to detect payments-service degradation
hard_002 Hard block Rate limit removal from API gateway β€” requires telemetry to confirm traffic surge risk

Action Space

Action Parameters Description
inspect_change section: diff|tests|approvals|files_changed Read the proposed change
inspect_services service: name Check service health and SLA metrics
inspect_dependencies β€” View blast radius and dependency graph
search_incidents keywords: list Search historical incident database
check_policy β€” Evaluate current rollout policy rules
query_telemetry metric, service, window Query live metrics per rollout phase
request_artifact artifact_type Fetch load tests, rollback plans, approvals
control_rollout decision: start_canary|promote|pause|rollback Advance the rollout state machine
submit_decision final_decision, reason_codes End the episode with a final verdict

Observation Space

Field Type Description
task_id str Current task identifier
change_summary str One-line description of the proposed change
known_risk_signals list[RiskSignal] Risks discovered so far (signal_id, severity, summary)
last_tool_result ToolResult Result of the last action taken
allowed_actions list[str] Actions valid in the current rollout phase
rollout_phase str precheck β†’ canary β†’ promoted | rolled_back
time_remaining int Steps remaining before timeout
cumulative_reward float Running reward total
final_score float|null Grader score strictly between 0 and 1 (set on terminal step)

Grading Formula

score = 0.35 * evidence_coverage
      + 0.25 * risk_signal_discovery
      + 0.30 * decision_correctness
      + 0.10 * efficiency
      - 0.30 * forbidden_penalty

Scores normalized to strict bounds (0, 1), i.e. [0.001, 0.999]. Fully deterministic β€” no LLM judge.

  • evidence_coverage: fraction of required evidence sources the agent inspected
  • risk_signal_discovery: fraction of required risk signals the environment emitted during the episode (objective β€” measures what the agent actually observed, not what strings it typed)
  • decision_correctness: 1.0 for optimal decision, 0.5 for acceptable, 0.0 for wrong
  • efficiency: peaks at 1.0 for 30–70% step usage, degrades toward 0 at extremes

Hard tasks require query_telemetry to discover critical pre-deployment anomalies. A rule-based agent that skips telemetry inspection will score ~0.77 on hard tasks, while an agent that queries live metrics across all affected services scores ~0.98. Easy/medium tasks are solvable without telemetry.

Baseline Scores (Heuristic Agent)

Task Score Decision
easy_001 0.983 request_changes
easy_002 0.983 request_changes
medium_001 0.983 approve
medium_002 0.983 approve
hard_001 0.773 request_changes
hard_002 0.760 block
Average 0.911

The gap between easy (0.983) and hard (0.767) scores reflects genuine difficulty: hard tasks require query_telemetry on multiple services to surface pre-deployment metric anomalies that static diff/test inspection cannot reveal.

Heuristic baseline runs via curl -X POST http://localhost:7860/baseline β€” no LLM required.

Validator Parity Checks

openenv validate
python3 scripts/validator_parity_check.py
pytest -q

CI runs the same checks in .github/workflows/validator-parity.yml on every push/PR.

Rollout State Machine

precheck --start_canary--> canary --promote--> promoted  [terminal]
                               |
                            rollback --> rolled_back      [terminal]
submit_decision ends the episode from any phase.

Running Inference Script

export API_BASE_URL="https://openrouter.ai/api/v1"   # or any OpenAI-compatible endpoint
export MODEL_NAME="meta-llama/llama-3.3-70b-instruct"
export OPENAI_API_KEY="sk-..."                         # or HF_TOKEN
export ENV_URL="https://your-space.hf.space"
python3 inference.py

ReleaseOps_OpenEnv

refresh