sre-gym / README.md
dakshdoesdev's picture
deploy sre-gym v2: easy/medium/hard scenarios + skill + verified-runbooks + demo
dc8501a verified
metadata
title: SRE Gym
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
pinned: false
license: apache-2.0

sre-gym β€” Fault-injecting SRE training env for OpenEnv

Most SRE agent skills are runbooks and good intentions. sre-gym is the other half: a fault-injecting environment with deterministic grading where an agent diagnoses a real production-style incident, chooses a safe remediation, verifies recovery, and declares resolved. Every run is scored the same way twice.

  • Spec-compliant OpenEnv environment (typed Pydantic action / observation / state, reset / step / state, openenv validate green).
  • 3 curriculum scenarios β€” easy, medium, hard β€” with decoy services and causal dependencies.
  • 11 bounded actions. Honest state transitions. No hidden oracles.
  • 21 tests passing.
  • Ships a Claude Code skill + verified-runbook loop β€” successful solves write markdown runbooks that the next run reads back.

30-second demo

./demo/run_demo.sh

Starts the env, solves each scenario cold, writes a runbook for each, re-solves to prove the loop. Full transcript takes ~10 seconds.

Curriculum

Difficulty Scenario Story Decoy Correct path
easy worker_deploy_cascade Bad worker deploy β†’ DB crash-loop β†’ login 502s β€” rollback worker β†’ restart db β†’ verify β†’ resolve
medium db_config_rollout DB config push shrank connection pool from 80β†’12 recent worker deploy rollback db β†’ restart db β†’ verify β†’ resolve
hard gateway_auth_rollout Gateway auth-middleware rollout rejects valid logins recent worker deploy rollback gateway β†’ verify β†’ resolve (no restart)

Rolling back the wrong service returns a negative reward and failure_type="wrong_remediation_target". Restarting before the cause is removed re-inherits the bad state. declare_resolved is rejected until the scenario's resolution check passes against the actual world model.

Install

# 1. Create a venv and install
python3 -m venv .venv && source .venv/bin/activate
pip install -e '.[dev]'

# 2. Start the env
uvicorn server.app:app --host 127.0.0.1 --port 8000

# 3. Run the baseline inference against it
export HF_TOKEN="…"; export ENV_BASE_URL=http://127.0.0.1:8000
python inference.py

Install the Claude Code skill

ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"

Then, in Claude Code, ask: "Solve the db_config_rollout scenario in sre-gym." The skill will drive the env via skill/tools/sre_gym_client.py, load any existing runbook from skill/verified-runbooks/, and append a fresh runbook on any clean solve (score > 0.85).

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      HTTP / WS       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Claude Code       β”‚ ──────────────────▢ β”‚  OpenEnv server       β”‚
β”‚  (with sre-gym     β”‚ ◀────────────────── β”‚  (FastAPI, uvicorn)   β”‚
β”‚   skill loaded)    β”‚    obs, reward      β”‚  unified_incident_env β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                                            β–²
        β–Ό on clean solve (score > 0.85)              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                               β”‚
β”‚ verified-runbooks/ β”‚ ────── loaded at skill load β”€β”€β”˜
β”‚   *.md             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Scoring

Deterministic, 5 dimensions, sums to a public score in [0.01, 0.99]:

  • Recovery (0–0.4): critical-path services healthy
  • Containment (0–0.3): root cause removed or offending service isolated
  • Verification (0–0.35): database_recovery + end_to_end checks passed
  • Impact (0–0.15): user-impact reduced
  • Efficiency (0–0.10): budget preserved, no wasteful repeats

Target > 0.85 for "clean solve." That's also the runbook-record threshold.

Repo layout

unified_incident_env/    # env core: models, environment, grader, challenge, tests
server/                  # OpenEnv entrypoint wrapper
skill/                   # Claude Code skill: SKILL.md, tools/, verified-runbooks/
demo/                    # run_demo.sh + pitch.md
inference.py             # OpenAI-client baseline for OpenEnv hackathon submission
openenv.yaml             # OpenEnv manifest
Dockerfile               # HF Space deployment

Verify

pytest unified_incident_env/tests -q          # 21 tests
python -m openenv.cli validate .              # OpenEnv manifest check
docker build -t sre-engineer-llm:v2 .         # HF Space image

Roadmap β€” v2

Distill the accumulated verified-runbooks/ corpus into a local 3B reviewer via OpenClaw-RL's async GRPO-on-next-state loop. Same reward contract (run_check passes / failure_type absent), same grader, but a compact policy that runs without a frontier API.

License

Apache 2.0