Spaces:

Madhav189
/

SystemTruth

Running

App Files Files Community

SystemTruth / execution.md

Madhav189

SystemTruth rebrand: bigger UI, new diagrams, theme-cohesive HF Space

6583a07 29 days ago

preview code

raw

history blame contribute delete

16.1 kB

Execution runbook — sre-gym

Operator guide. From clone → live env → training → submission. Updated after the hackathon training run; reflects the current state of the codebase.

Current state

Item	Status
Triage env (12 templates × 6 entries = 72 scenarios)	✅ runnable end-to-end
Strategy orchestrator (chains Triage episodes)	✅ runnable as Python orchestrator
Operations graph state-machine simulator (22 nodes, 11 chaos patterns)	✅ runnable in Python
Operations docker-compose stack (`ghcr.io/sre-gym/*` images)	🟡 design-spec — images not published
Strategy 28-action universe (in YAML)	🟡 design-spec — runner uses Triage 11
Gradio UI mounted at `/` of the FastAPI server	✅ live
MCP JSON-RPC 2.0 dual-route at `/mcp`	✅ live + parity-tested
Coliseum parallel-rollout pool server	✅ live
Pytest suite	✅ green at HEAD
`openenv validate .`	✅ green
End-to-end SFT → GRPO run (Qwen2.5-7B)	✅ executed
Eval comparison run (5 policies × 36 episodes)	✅ executed
Trained-model row in baselines table	✅ measured (`mean=0.379` — see §7)

The honest framing: the env is the project, the rubric is the engineering crown jewel, and the training run is below the heuristic plateau because the corpus + step budget that fit inside a hackathon weekend aren't enough to break it. Pretending otherwise is the original sin of every other SRE-agent demo. We don't.

Prerequisites
Local setup
First-run smoke test
Tier-aware operation
Scenario authoring quickstart
Training pipeline (Triage SFT → GRPO)
Eval comparison sweep
HF Space deployment
Coliseum — parallel-rollout pool server
Claude Code skill setup
Troubleshooting
Submission checklist
Operator FAQ
Materials

1. Prerequisites

Local development (env serving + tests):

Python 3.10+ (3.11 / 3.12 / 3.14 verified)
pip 24+ or uv
Git
Docker (only required for HF Space build; not required for normal env serving)
4 GB free RAM, 2 GB free disk

Training (Triage SFT → GRPO on Qwen2.5-7B):

1×A100 80GB (HF Pro Spaces, Colab A100, or rented)
HF account + token (HF_TOKEN) with write scope for adapter push
~$5–8 of HF compute credits for one ~2-3h end-to-end run
Optional: Anthropic / Fireworks / Groq key for richer comparison rows

2. Local setup

git clone https://github.com/Madhav-GPT/SystemTruth.git
cd sre-env

python3 -m venv .venv
source .venv/bin/activate                    # Windows: .venv\Scripts\activate

pip install --upgrade pip
pip install -e '.[dev]'

Verify:

make test                                     # green
python -m openenv.cli validate .              # green

3. First-run smoke test

Boot the combined Gradio + FastAPI server:

uvicorn app:app --host 127.0.0.1 --port 7860

Then in a second shell:

curl -s http://127.0.0.1:7860/health | jq
# {"status": "ok", "environment": "unified_incident_env", ...}

curl -s http://127.0.0.1:7860/tasks | jq '.scenarios | length'
# 72

curl -s http://127.0.0.1:7860/mcp/tools | jq '.tools | length'
# 11

Hit a scenario via /reset + /step:

curl -s -X POST http://127.0.0.1:7860/reset \
  -H 'Content-Type: application/json' \
  -d '{"scenario_id":"memory_leak_oom"}' | jq '.observation.workflow_stage'
# "triage"

curl -s -X POST http://127.0.0.1:7860/step \
  -H 'Content-Type: application/json' \
  -d '{"action":{"action_type":"query_logs","service":"worker"}}' | jq '.observation.tool_output'
# "Worker logs: 'process killed (OOM)' every ~90s..."

Run the scripted-baseline smoke against all 12 templates:

make baseline
# scripted-optimal mean across all 12 templates: ~0.94
# 12 / 12 resolved

4. Tier-aware operation

make tier-info                                # prints per-tier metadata

Programmatic API:

from sre_gym import SREGym, Tier

# Triage — live FastAPI env
env = SREGym(tier=Tier.TRIAGE)
obs = env.reset(scenario_id="memory_leak_oom__p02")
obs = env.step({"action_type": "rollback_deploy", "service": "worker"})
result = env.run("memory_leak_oom__p02", seed=42)

# Strategy — chained Triage episodes with horizon state
env = SREGym(tier=Tier.STRATEGY)
result = env.run("cascading_release_train", seed=1)

# Operations — Python state-machine simulator
env = SREGym(tier=Tier.OPERATIONS)
obs = env.reset(family_id="ecommerce_vibecoded_saas", chaos="rls_silent_leak", seed=1)
obs = env.step({"action_type": "rollback_deploy", "service": "postgres-primary"})

CLI:

python -m sre_gym.strategy list
python -m sre_gym.strategy run cascading_release_train --seed 1

python -m sre_gym.operations list-chaos
python -m sre_gym.operations run ecommerce_vibecoded_saas --chaos rls_silent_leak

5. Scenario authoring quickstart

5.1 Add a 13th Triage template

Append the template dict to EXTRA_TEMPLATES in unified_incident_env/server/basic_templates_extra.py.
Append a baseline-action lambda to extra_baselines().
Append the new RootCauseType value to unified_incident_env/models.py.
Append the template_id to ROUND2_TEMPLATES in tests/test_round2_templates.py.

make test exercises all of the above automatically. Procgen variants generate at module-import time.

5.2 Add a Strategy reference scenario

Drop a new YAML in sre_gym/strategy/scenarios/. Include the DESIGN-SPEC HEADER the existing scenarios carry — call out which subset of allowed_actions: is implemented vs design-spec. The runner falls back to the Triage 11 actions for anything else.

5.3 Add an Operations chaos pattern

Triplet of YAMLs:

sre_gym/operations/families/<id>.yaml — family-level spec
sre_gym/operations/chaos/<id>_chaos_library.yaml — composable chaos patterns
sre_gym/operations/compose/<id>.yaml — docker-compose stack (mark as design-spec if images aren't published)

Then add the chaos descriptors to CHAOS_PATTERN_DEFAULTS in sre_gym/operations/runner.py so the simulator can run them.

See docs/SCENARIO_AUTHORING.md for the full schema.

6. Training pipeline (Triage SFT → GRPO)

6.1 What ships

notebooks/01_triage_train_grpo_qwen25_7b.ipynb is the canonical, end-to-end training notebook. Cells:

#	Cell	What it does
0	Bootstrap	uv + Unsloth pinned-version install. Idempotent.
1	GPU verify	nvidia-smi + torch.cuda.is_available()
2	Build corpus	`train/build_corpus.py` → 120-episode trajectory corpus, 60/20/20 quality split
3	Sanity-check corpus	template coverage, score distribution, tier counts
4	Build SFT dataset	ChatML formatting, 999 step-pairs
5	Load Qwen2.5-7B (4-bit) + LoRA r=32	Unsloth FastLanguageModel
6	SFT cold-start	50 steps × batch 16, lr=5e-5, eval perplexity gate
7	Build GRPO prompts	120 prompts pre-rendered with the same chat template
8	Reward function	composite + first-action bonus + key-alias normalisation
9	GRPO online	40 steps × K=2 rollouts, beta=0.1, temperature=0.9
10	Eval comparison sweep	5 policies × 12 scenarios × 3 seeds
11	Summary table + plots	hero bar chart + per-template chart
12	Push to HF Hub	adapter upload
13	Package artifacts	tar.gz the outputs/ dir

6.2 Run it

In Colab / HF Space (recommended, A100 80GB):

Open the notebook
Set runtime to A100 80GB
Set HF_TOKEN in Colab Secrets (or paste in cell 12 directly)
Run-All — top to bottom, ~2-3h end-to-end

Resume points: if outputs/qwen25_7b_sft_final/ exists, cell 6 skips. If outputs/qwen25_7b_grpo_final/ exists, cell 9 skips. If eval/results/qwen25_7b_comparison_raw.csv exists, cell 10 skips. Delete the artifact to force a re-run.

6.3 Stages (measured)

Stage	Steps	Wall-clock on A100 80GB	Output
Build SFT corpus from teacher trajectories	one-time	~30s	`train/data/seed_v2_120.jsonl` (120 episodes)
SFT cold-start (50 steps × batch 16)	50	~7 min	`outputs/qwen25_7b_sft_final/`
GRPO online (40 steps × K=2)	40	~50 min (transformers fallback) / ~15 min (vLLM path)	`outputs/qwen25_7b_grpo_final/`
Eval sweep (5 policies × 12 × 3)	180 episodes	~25 min	`eval/results/qwen25_7b_comparison_.csv + .png`

7. Eval comparison sweep

The notebook's cell 10 runs the full 5-policy comparison and writes:

eval/results/qwen25_7b_comparison_raw.csv — every per-episode row
eval/results/qwen25_7b_comparison_summary.csv — per-policy aggregates
eval/results/qwen25_7b_comparison_hero.png — single-axis bar chart with whiskers
eval/results/qwen25_7b_comparison_per_template.png — per-template grouped bars

Latest measured numbers

policy	mean	median	p25	p75	resolved_rate
random	0.342	0.378	0.340	0.380	0/36
qwen25-7b-sft-only	0.379	0.380	0.378	0.380	0/36
qwen25-7b-grpo	0.379	0.380	0.378	0.380	0/36
heuristic	0.704	0.705	0.703	0.705	0/36
scripted-optimal	0.938	0.939	0.937	0.940	36/36

SFT lifted the model 11% above random. GRPO added zero on K=2 / 40 steps. Both still below the heuristic plateau at 0.704. The training-time bottleneck is corpus size + step budget, not the env — see README.md §"Training & datasets" for the framing.

8. HF Space deployment

The repo is configured as an HF Space (Docker SDK):

# top of README.md — HF Space frontmatter
sdk: docker
app_port: 7860

Dockerfile builds the FastAPI + Gradio app. The Space rebuilds automatically on push to main. To push:

# One-time: add the HF Space as a git remote
git remote add hf https://huggingface.co/spaces/Madhav189/SystemTruth

# Push (HF prompts for token if not cached)
git push hf main

The Space runs on CPU-basic by default — no GPU required for the Triage env. If the user provides an HF Inference Router model in the UI, calls go to that model; otherwise the run is gated until a token + model are pasted.

9. Coliseum — parallel-rollout pool server

coliseum/ wraps the Triage env in a lease-based HTTP contract so a GRPO trainer can drive 8 concurrent rollouts on a single process via per-lease asyncio.Lock.

# Boot the pool server
uvicorn coliseum.server:app --host 0.0.0.0 --port 8100

# Drive it from a trainer
export COLISEUM_BASE_URL=http://127.0.0.1:8100

Endpoints: /allocate, /reset, /exec_tool, /evaluate, /close, /healthz. The ArenaClient in coliseum/client.py drives them with retry/backoff per route. See coliseum/README.md for the full env-var table.

10. Claude Code skill setup

Path A (zero training): the env packages cleanly as a Claude Code skill.

# Install the skill globally (one-time)
ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"

# Or run the end-to-end demo
bash demo/run_demo.sh

12 verified-runbook drafts ship in skill/verified-runbooks/ — one per Triage template. The skill validates them by re-running the env after each solve.

11. Troubleshooting

make test fails on import error — usually means pip install -e '.[dev]' skipped a dep. pip install pytest pyyaml httpx and retry.

make baseline reports mean > 0.80 — the rubric is leaking. The CI invariant test_heuristic_ceiling_is_in_band should have caught this; check unified_incident_env/server/grader.py weights.

uvicorn app:app crashes with ImportError: openenv — pip install openenv-core>=0.2.1 (the package name is openenv-core but the import is openenv.core).

Cell 9 of the training notebook errors with 'Qwen2ForCausalLM' object has no attribute 'vllm_engine' — Cell 5 didn't pass fast_inference=True when loading the model. The notebook's preflight check now detects this and falls back to the transformers model.generate path automatically. ~3× slower but always works.

Cell 9 errors with EADDRINUSE on port 12345 — a previous failed init_process_group left the port bound. Restart the kernel and re-run from Cell 0. The current cell defensively calls dist.destroy_process_group() before any new init.

reward_std = 0 in early GRPO steps — model emits the same JSON shape on every K rollout (entropy collapse). Bump temperature=0.9 → 1.1 in cell 9's _build_grpo_args.

12. Submission checklist

Repo public on GitHub: https://github.com/Madhav-GPT/SystemTruth
HF Space live: https://huggingface.co/spaces/Madhav189/SystemTruth
BLOG.md at repo root
6 blog assets in docs/blog/
Training notebook executed end-to-end, results in eval/results/
README links blog + Space + GitHub + notebook + license
HF Space README links blog + GitHub + notebook
openenv validate . green
Pytest suite green
Hackathon submission form: paste the HF Space URL as the canonical entry point

13. Operator FAQ

Q: Why does Random outperform the Heuristic on some templates? The heuristic commits to a fixed wrong sequence on a few templates while Random sometimes stumbles into useful evidence-gathering and earns shaped per-tick reward. Documented rather than buried.

Q: Why do all 11 chaos patterns name the failing service in the Operations incident summary? Because the simulator is a fault-injection harness, not a hidden-information puzzle. A real-cluster Operations tier would use raw Loki / Tempo signals; the Python sim doesn't claim to.

Q: Why is supabase_rls_silent_leak approximated by payment_webhook_misconfig + migration_lock + worker_deploy_cascade in the Strategy runner? Because there's no Supabase-RLS Triage template; the Strategy runner approximates higher-tier scenarios via the closest-shaped Triage templates. Documented as approximation, not fidelity.

Q: Why did GRPO not beat SFT on the 7B run? K=2 rollouts × 40 steps on a 7B model with a 120-episode corpus is too small a budget to break the heuristic plateau at 0.704. The env, the rubric, and Coliseum are sized for a much bigger run; the corpus and step budget are what need to scale next.

14. Materials

README.md — repo overview, the README judges read first
BLOG.md — the hackathon blog with the 6 assets in docs/blog/
openenv.yaml — declares the three tiers, runnable kinds, scenario counts
docs/ — architecture, per-tier deep dives, reward design, scenario authoring
notebooks/01_triage_train_grpo_qwen25_7b.ipynb — the canonical training notebook
notebooks/02_triage_eval_compare_all.ipynb — multi-policy eval comparison
notebooks/03_strategy_blueprint_walkthrough.ipynb — Strategy tier walkthrough
notebooks/04_operations_demo_chaos.ipynb — Operations tier walkthrough
coliseum/README.md — parallel-rollout pool server
skill/SKILL.md — Claude Code skill (Path A)
eval/results/ — eval CSVs + plots
train/data/ — teacher trajectory corpora (120 + extras)