Spaces:
Running
title: SRE Gym
emoji: π¨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
pinned: false
license: apache-2.0
sre-gym β Fault-injecting SRE training env for OpenEnv
Most SRE agent skills are runbooks and good intentions. sre-gym is the other half: a fault-injecting environment with deterministic grading where an agent diagnoses a real production-style incident, chooses a safe remediation, verifies recovery, and declares resolved. Every run is scored the same way twice.
- Spec-compliant OpenEnv environment (typed Pydantic action / observation / state,
reset/step/state,openenv validategreen). - 3 curriculum scenarios β easy, medium, hard β with decoy services and causal dependencies.
- 11 bounded actions. Honest state transitions. No hidden oracles.
- 21 tests passing.
- Ships a Claude Code skill + verified-runbook loop β successful solves write markdown runbooks that the next run reads back.
30-second demo
./demo/run_demo.sh
Starts the env, solves each scenario cold, writes a runbook for each, re-solves to prove the loop. Full transcript takes ~10 seconds.
Curriculum
| Difficulty | Scenario | Story | Decoy | Correct path |
|---|---|---|---|---|
| easy | worker_deploy_cascade |
Bad worker deploy β DB crash-loop β login 502s | β | rollback worker β restart db β verify β resolve |
| medium | db_config_rollout |
DB config push shrank connection pool from 80β12 | recent worker deploy | rollback db β restart db β verify β resolve |
| hard | gateway_auth_rollout |
Gateway auth-middleware rollout rejects valid logins | recent worker deploy | rollback gateway β verify β resolve (no restart) |
Rolling back the wrong service returns a negative reward and failure_type="wrong_remediation_target". Restarting before the cause is removed re-inherits the bad state. declare_resolved is rejected until the scenario's resolution check passes against the actual world model.
Install
# 1. Create a venv and install
python3 -m venv .venv && source .venv/bin/activate
pip install -e '.[dev]'
# 2. Start the env
uvicorn server.app:app --host 127.0.0.1 --port 8000
# 3. Run the baseline inference against it
export HF_TOKEN="β¦"; export ENV_BASE_URL=http://127.0.0.1:8000
python inference.py
Install the Claude Code skill
ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"
Then, in Claude Code, ask: "Solve the db_config_rollout scenario in sre-gym." The skill will drive the env via skill/tools/sre_gym_client.py, load any existing runbook from skill/verified-runbooks/, and append a fresh runbook on any clean solve (score > 0.85).
Architecture
ββββββββββββββββββββββ HTTP / WS ββββββββββββββββββββββββ
β Claude Code β βββββββββββββββββββΆ β OpenEnv server β
β (with sre-gym β βββββββββββββββββββ β (FastAPI, uvicorn) β
β skill loaded) β obs, reward β unified_incident_env β
ββββββββββββββββββββββ ββββββββββββββββββββββββ
β β²
βΌ on clean solve (score > 0.85) β
ββββββββββββββββββββββ β
β verified-runbooks/ β ββββββ loaded at skill load βββ
β *.md β
ββββββββββββββββββββββ
Scoring
Deterministic, 5 dimensions, sums to a public score in [0.01, 0.99]:
- Recovery (0β0.4): critical-path services healthy
- Containment (0β0.3): root cause removed or offending service isolated
- Verification (0β0.35):
database_recovery+end_to_endchecks passed - Impact (0β0.15): user-impact reduced
- Efficiency (0β0.10): budget preserved, no wasteful repeats
Target > 0.85 for "clean solve." That's also the runbook-record threshold.
Repo layout
unified_incident_env/ # env core: models, environment, grader, challenge, tests
server/ # OpenEnv entrypoint wrapper
skill/ # Claude Code skill: SKILL.md, tools/, verified-runbooks/
demo/ # run_demo.sh + pitch.md
inference.py # OpenAI-client baseline for OpenEnv hackathon submission
openenv.yaml # OpenEnv manifest
Dockerfile # HF Space deployment
Verify
pytest unified_incident_env/tests -q # 21 tests
python -m openenv.cli validate . # OpenEnv manifest check
docker build -t sre-engineer-llm:v2 . # HF Space image
Roadmap β v2
Distill the accumulated verified-runbooks/ corpus into a local 3B reviewer via OpenClaw-RL's async GRPO-on-next-state loop. Same reward contract (run_check passes / failure_type absent), same grader, but a compact policy that runs without a frontier API.
License
Apache 2.0