Spaces:
Running
Execution runbook β sre-gym
Operator guide. From clone β live env β training β submission. Updated after the hackathon training run; reflects the current state of the codebase.
Current state
| Item | Status |
|---|---|
| Triage env (12 templates Γ 6 entries = 72 scenarios) | β runnable end-to-end |
| Strategy orchestrator (chains Triage episodes) | β runnable as Python orchestrator |
| Operations graph state-machine simulator (22 nodes, 11 chaos patterns) | β runnable in Python |
Operations docker-compose stack (ghcr.io/sre-gym/* images) |
π‘ design-spec β images not published |
| Strategy 28-action universe (in YAML) | π‘ design-spec β runner uses Triage 11 |
Gradio UI mounted at / of the FastAPI server |
β live |
MCP JSON-RPC 2.0 dual-route at /mcp |
β live + parity-tested |
| Coliseum parallel-rollout pool server | β live |
| Pytest suite | β green at HEAD |
openenv validate . |
β green |
| End-to-end SFT β GRPO run (Qwen2.5-7B) | β executed |
| Eval comparison run (5 policies Γ 36 episodes) | β executed |
| Trained-model row in baselines table | β
measured (mean=0.379 β see Β§7) |
The honest framing: the env is the project, the rubric is the engineering crown jewel, and the training run is below the heuristic plateau because the corpus + step budget that fit inside a hackathon weekend aren't enough to break it. Pretending otherwise is the original sin of every other SRE-agent demo. We don't.
Table of contents
- Prerequisites
- Local setup
- First-run smoke test
- Tier-aware operation
- Scenario authoring quickstart
- Training pipeline (Triage SFT β GRPO)
- Eval comparison sweep
- HF Space deployment
- Coliseum β parallel-rollout pool server
- Claude Code skill setup
- Troubleshooting
- Submission checklist
- Operator FAQ
- Materials
1. Prerequisites
Local development (env serving + tests):
- Python 3.10+ (3.11 / 3.12 / 3.14 verified)
- pip 24+ or uv
- Git
- Docker (only required for HF Space build; not required for normal env serving)
- 4 GB free RAM, 2 GB free disk
Training (Triage SFT β GRPO on Qwen2.5-7B):
- 1ΓA100 80GB (HF Pro Spaces, Colab A100, or rented)
- HF account + token (
HF_TOKEN) with write scope for adapter push - ~$5β8 of HF compute credits for one ~2-3h end-to-end run
- Optional: Anthropic / Fireworks / Groq key for richer comparison rows
2. Local setup
git clone https://github.com/Madhav-GPT/SystemTruth.git
cd sre-env
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -e '.[dev]'
Verify:
make test # green
python -m openenv.cli validate . # green
3. First-run smoke test
Boot the combined Gradio + FastAPI server:
uvicorn app:app --host 127.0.0.1 --port 7860
Then in a second shell:
curl -s http://127.0.0.1:7860/health | jq
# {"status": "ok", "environment": "unified_incident_env", ...}
curl -s http://127.0.0.1:7860/tasks | jq '.scenarios | length'
# 72
curl -s http://127.0.0.1:7860/mcp/tools | jq '.tools | length'
# 11
Hit a scenario via /reset + /step:
curl -s -X POST http://127.0.0.1:7860/reset \
-H 'Content-Type: application/json' \
-d '{"scenario_id":"memory_leak_oom"}' | jq '.observation.workflow_stage'
# "triage"
curl -s -X POST http://127.0.0.1:7860/step \
-H 'Content-Type: application/json' \
-d '{"action":{"action_type":"query_logs","service":"worker"}}' | jq '.observation.tool_output'
# "Worker logs: 'process killed (OOM)' every ~90s..."
Run the scripted-baseline smoke against all 12 templates:
make baseline
# scripted-optimal mean across all 12 templates: ~0.94
# 12 / 12 resolved
4. Tier-aware operation
make tier-info # prints per-tier metadata
Programmatic API:
from sre_gym import SREGym, Tier
# Triage β live FastAPI env
env = SREGym(tier=Tier.TRIAGE)
obs = env.reset(scenario_id="memory_leak_oom__p02")
obs = env.step({"action_type": "rollback_deploy", "service": "worker"})
result = env.run("memory_leak_oom__p02", seed=42)
# Strategy β chained Triage episodes with horizon state
env = SREGym(tier=Tier.STRATEGY)
result = env.run("cascading_release_train", seed=1)
# Operations β Python state-machine simulator
env = SREGym(tier=Tier.OPERATIONS)
obs = env.reset(family_id="ecommerce_vibecoded_saas", chaos="rls_silent_leak", seed=1)
obs = env.step({"action_type": "rollback_deploy", "service": "postgres-primary"})
CLI:
python -m sre_gym.strategy list
python -m sre_gym.strategy run cascading_release_train --seed 1
python -m sre_gym.operations list-chaos
python -m sre_gym.operations run ecommerce_vibecoded_saas --chaos rls_silent_leak
5. Scenario authoring quickstart
5.1 Add a 13th Triage template
- Append the template dict to
EXTRA_TEMPLATESinunified_incident_env/server/basic_templates_extra.py. - Append a baseline-action lambda to
extra_baselines(). - Append the new
RootCauseTypevalue tounified_incident_env/models.py. - Append the template_id to
ROUND2_TEMPLATESintests/test_round2_templates.py.
make test exercises all of the above automatically. Procgen variants generate at module-import time.
5.2 Add a Strategy reference scenario
Drop a new YAML in sre_gym/strategy/scenarios/. Include the DESIGN-SPEC HEADER the existing scenarios carry β call out which subset of allowed_actions: is implemented vs design-spec. The runner falls back to the Triage 11 actions for anything else.
5.3 Add an Operations chaos pattern
Triplet of YAMLs:
sre_gym/operations/families/<id>.yamlβ family-level specsre_gym/operations/chaos/<id>_chaos_library.yamlβ composable chaos patternssre_gym/operations/compose/<id>.yamlβ docker-compose stack (mark as design-spec if images aren't published)
Then add the chaos descriptors to CHAOS_PATTERN_DEFAULTS in sre_gym/operations/runner.py so the simulator can run them.
See docs/SCENARIO_AUTHORING.md for the full schema.
6. Training pipeline (Triage SFT β GRPO)
6.1 What ships
notebooks/01_triage_train_grpo_qwen25_7b.ipynb is the canonical, end-to-end training notebook. Cells:
| # | Cell | What it does |
|---|---|---|
| 0 | Bootstrap | uv + Unsloth pinned-version install. Idempotent. |
| 1 | GPU verify | nvidia-smi + torch.cuda.is_available() |
| 2 | Build corpus | train/build_corpus.py β 120-episode trajectory corpus, 60/20/20 quality split |
| 3 | Sanity-check corpus | template coverage, score distribution, tier counts |
| 4 | Build SFT dataset | ChatML formatting, 999 step-pairs |
| 5 | Load Qwen2.5-7B (4-bit) + LoRA r=32 | Unsloth FastLanguageModel |
| 6 | SFT cold-start | 50 steps Γ batch 16, lr=5e-5, eval perplexity gate |
| 7 | Build GRPO prompts | 120 prompts pre-rendered with the same chat template |
| 8 | Reward function | composite + first-action bonus + key-alias normalisation |
| 9 | GRPO online | 40 steps Γ K=2 rollouts, beta=0.1, temperature=0.9 |
| 10 | Eval comparison sweep | 5 policies Γ 12 scenarios Γ 3 seeds |
| 11 | Summary table + plots | hero bar chart + per-template chart |
| 12 | Push to HF Hub | adapter upload |
| 13 | Package artifacts | tar.gz the outputs/ dir |
6.2 Run it
In Colab / HF Space (recommended, A100 80GB):
- Open the notebook
- Set runtime to A100 80GB
- Set
HF_TOKENin Colab Secrets (or paste in cell 12 directly) - Run-All β top to bottom, ~2-3h end-to-end
Resume points: if outputs/qwen25_7b_sft_final/ exists, cell 6 skips. If outputs/qwen25_7b_grpo_final/ exists, cell 9 skips. If eval/results/qwen25_7b_comparison_raw.csv exists, cell 10 skips. Delete the artifact to force a re-run.
6.3 Stages (measured)
| Stage | Steps | Wall-clock on A100 80GB | Output |
|---|---|---|---|
| Build SFT corpus from teacher trajectories | one-time | ~30s | train/data/seed_v2_120.jsonl (120 episodes) |
| SFT cold-start (50 steps Γ batch 16) | 50 | ~7 min | outputs/qwen25_7b_sft_final/ |
| GRPO online (40 steps Γ K=2) | 40 | ~50 min (transformers fallback) / ~15 min (vLLM path) | outputs/qwen25_7b_grpo_final/ |
| Eval sweep (5 policies Γ 12 Γ 3) | 180 episodes | ~25 min | eval/results/qwen25_7b_comparison_*.csv + *.png |
7. Eval comparison sweep
The notebook's cell 10 runs the full 5-policy comparison and writes:
eval/results/qwen25_7b_comparison_raw.csvβ every per-episode roweval/results/qwen25_7b_comparison_summary.csvβ per-policy aggregateseval/results/qwen25_7b_comparison_hero.pngβ single-axis bar chart with whiskerseval/results/qwen25_7b_comparison_per_template.pngβ per-template grouped bars
Latest measured numbers
| policy | mean | median | p25 | p75 | resolved_rate |
|---|---|---|---|---|---|
| random | 0.342 | 0.378 | 0.340 | 0.380 | 0/36 |
| qwen25-7b-sft-only | 0.379 | 0.380 | 0.378 | 0.380 | 0/36 |
| qwen25-7b-grpo | 0.379 | 0.380 | 0.378 | 0.380 | 0/36 |
| heuristic | 0.704 | 0.705 | 0.703 | 0.705 | 0/36 |
| scripted-optimal | 0.938 | 0.939 | 0.937 | 0.940 | 36/36 |
SFT lifted the model 11% above random. GRPO added zero on K=2 / 40 steps. Both still below the heuristic plateau at 0.704. The training-time bottleneck is corpus size + step budget, not the env β see README.md Β§"Training & datasets" for the framing.
8. HF Space deployment
The repo is configured as an HF Space (Docker SDK):
# top of README.md β HF Space frontmatter
sdk: docker
app_port: 7860
Dockerfile builds the FastAPI + Gradio app. The Space rebuilds automatically on push to main. To push:
# One-time: add the HF Space as a git remote
git remote add hf https://huggingface.co/spaces/Madhav189/SystemTruth
# Push (HF prompts for token if not cached)
git push hf main
The Space runs on CPU-basic by default β no GPU required for the Triage env. If the user provides an HF Inference Router model in the UI, calls go to that model; otherwise the run is gated until a token + model are pasted.
9. Coliseum β parallel-rollout pool server
coliseum/ wraps the Triage env in a lease-based HTTP contract so a GRPO trainer can drive 8 concurrent rollouts on a single process via per-lease asyncio.Lock.
# Boot the pool server
uvicorn coliseum.server:app --host 0.0.0.0 --port 8100
# Drive it from a trainer
export COLISEUM_BASE_URL=http://127.0.0.1:8100
Endpoints: /allocate, /reset, /exec_tool, /evaluate, /close, /healthz. The ArenaClient in coliseum/client.py drives them with retry/backoff per route. See coliseum/README.md for the full env-var table.
10. Claude Code skill setup
Path A (zero training): the env packages cleanly as a Claude Code skill.
# Install the skill globally (one-time)
ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"
# Or run the end-to-end demo
bash demo/run_demo.sh
12 verified-runbook drafts ship in skill/verified-runbooks/ β one per Triage template. The skill validates them by re-running the env after each solve.
11. Troubleshooting
make test fails on import error β usually means pip install -e '.[dev]' skipped a dep. pip install pytest pyyaml httpx and retry.
make baseline reports mean > 0.80 β the rubric is leaking. The CI invariant test_heuristic_ceiling_is_in_band should have caught this; check unified_incident_env/server/grader.py weights.
uvicorn app:app crashes with ImportError: openenv β pip install openenv-core>=0.2.1 (the package name is openenv-core but the import is openenv.core).
Cell 9 of the training notebook errors with 'Qwen2ForCausalLM' object has no attribute 'vllm_engine' β Cell 5 didn't pass fast_inference=True when loading the model. The notebook's preflight check now detects this and falls back to the transformers model.generate path automatically. ~3Γ slower but always works.
Cell 9 errors with EADDRINUSE on port 12345 β a previous failed init_process_group left the port bound. Restart the kernel and re-run from Cell 0. The current cell defensively calls dist.destroy_process_group() before any new init.
reward_std = 0 in early GRPO steps β model emits the same JSON shape on every K rollout (entropy collapse). Bump temperature=0.9 β 1.1 in cell 9's _build_grpo_args.
12. Submission checklist
- Repo public on GitHub: https://github.com/Madhav-GPT/SystemTruth
- HF Space live: https://huggingface.co/spaces/Madhav189/SystemTruth
- BLOG.md at repo root
- 6 blog assets in
docs/blog/ - Training notebook executed end-to-end, results in
eval/results/ - README links blog + Space + GitHub + notebook + license
- HF Space README links blog + GitHub + notebook
-
openenv validate .green - Pytest suite green
- Hackathon submission form: paste the HF Space URL as the canonical entry point
13. Operator FAQ
Q: Why does Random outperform the Heuristic on some templates? The heuristic commits to a fixed wrong sequence on a few templates while Random sometimes stumbles into useful evidence-gathering and earns shaped per-tick reward. Documented rather than buried.
Q: Why do all 11 chaos patterns name the failing service in the Operations incident summary? Because the simulator is a fault-injection harness, not a hidden-information puzzle. A real-cluster Operations tier would use raw Loki / Tempo signals; the Python sim doesn't claim to.
Q: Why is supabase_rls_silent_leak approximated by payment_webhook_misconfig + migration_lock + worker_deploy_cascade in the Strategy runner?
Because there's no Supabase-RLS Triage template; the Strategy runner approximates higher-tier scenarios via the closest-shaped Triage templates. Documented as approximation, not fidelity.
Q: Why did GRPO not beat SFT on the 7B run? K=2 rollouts Γ 40 steps on a 7B model with a 120-episode corpus is too small a budget to break the heuristic plateau at 0.704. The env, the rubric, and Coliseum are sized for a much bigger run; the corpus and step budget are what need to scale next.
14. Materials
README.mdβ repo overview, the README judges read firstBLOG.mdβ the hackathon blog with the 6 assets indocs/blog/openenv.yamlβ declares the three tiers, runnable kinds, scenario countsdocs/β architecture, per-tier deep dives, reward design, scenario authoringnotebooks/01_triage_train_grpo_qwen25_7b.ipynbβ the canonical training notebooknotebooks/02_triage_eval_compare_all.ipynbβ multi-policy eval comparisonnotebooks/03_strategy_blueprint_walkthrough.ipynbβ Strategy tier walkthroughnotebooks/04_operations_demo_chaos.ipynbβ Operations tier walkthroughcoliseum/README.mdβ parallel-rollout pool serverskill/SKILL.mdβ Claude Code skill (Path A)eval/results/β eval CSVs + plotstrain/data/β teacher trajectory corpora (120 + extras)