Spaces:
Running
title: SystemTruth
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
SystemTruth — a tier-escalating SRE training environment
Hackathon submission — OpenEnv-class, India 2026
- 📖 Blog: BLOG.md
- 🚀 Live HF Space: https://huggingface.co/spaces/Madhav189/SystemTruth
- 💻 GitHub: https://github.com/Madhav-GPT/SystemTruth
- 🧪 Training notebook (Colab):
— same as
notebooks/01_triage_train_grpo_qwen25_7b.ipynb- 📊 Eval results:
eval/results/- 📜 License: Apache 2.0
Each tier escalates a different dimension. Triage escalates compute, Strategy escalates horizon, Operations escalates realism. That single sentence is the load-bearing claim of the project.
What's in the box (the USP — read this first)
SystemTruth is one runnable RL environment with three personas baked into it. The same 11-action contract, the same 5-component reward rubric, the same termination shape — escalated along three orthogonal axes that map to the three real bottlenecks SRE-agent training loops actually hit.
One environment, three tiers, three different bottlenecks
| Tier | Bottleneck | Persona | What it teaches |
|---|---|---|---|
| Triage | Compute | ML student / Kaggle, $30 of HF credits | causal mapping under tight context — pre-digested observations, dense reward shaping, 8K context, 11-action space, 8–13 ticks per episode |
| Strategy | Horizon | Seed-stage startup, $300–500 budget | long-horizon planning across chained incidents — multi-incident chains with persistent state, unresolved alerts and pending deploys carry forward, 60–90 ticks |
| Operations | Realism | Enterprise SRE platform, 8×A100/H100 cluster | authentic tool use against irreversible actions — 22-node service graph, 11 chaos patterns pinned to real production post-mortems, 110–180+ actions per episode |
The escalation axis is the entire pitch. Most RL environments stratify by difficulty (more scenarios, longer episodes, harder rewards). SystemTruth stratifies by the dimension that actually limits the training loop for that persona:
- A junior on-call learning to triage faces a different problem (cognitive efficiency under tight context) than a senior SRE running a multi-incident postmortem (state tracking across long horizons), which faces a different problem from an enterprise platform team operating against an actively chaos-engineered cluster (irreversible actions, partial observability, real wall-clock).
- Their training signals, episode shapes, observation richness, and reward structures should not look the same.
- SystemTruth takes that observation seriously and stratifies its tiers along the dimension that actually limits the persona's training loop.
The shared 11-action contract
Every tier — Triage, Strategy, Operations — speaks the same eleven Pydantic-validated actions. One contract, three escalation envelopes:
query_logs(service) query_metrics(service, metric)
query_dependencies(service) query_deploys(service)
rollback_deploy(service) restart_service(service)
isolate_service(service) run_check(check_name)
submit_hypothesis(hypothesis) escalate
declare_resolved
A successful episode is gather evidence → submit_hypothesis → rollback_deploy → restart_service → both run_check pass → declare_resolved. Wrong rollback target, premature restart, or premature resolution all return negative reward and a typed failure_type. The contract refuses to be gamed.
The episode lifecycle, illustrated
The lifecycle below is the Triage tier in detail; Strategy chains N of them with horizon-decay, Operations runs one of them inside a graph-mutation simulator. The shape is shared across all three tiers — the simulator under it is what changes.
Eleven numbered stages, each producing a measurable signal:
reset(scenario_id)— env emits the initial observation: tick counter, workflow stage, incident summary, active alerts, noise alerts (decoys), service health (cpu/mem/err/latency), user impact, SLO burn rate, checks, allowed actions.- Evidence gathering loop — agent calls
query_logs / query_metrics / query_dependencies / query_deploys. After every step the env computes a per-tick shaped reward as a potential difference (Δ critical_service_health × 0.55 + Δ (1 − user_impact) × 0.20 + Δ (1 − slo_burn_rate) × 0.15 + containment_applied × 0.10) minusstep_cost, plusbonus, minuspenalty. submit_hypothesis(root_cause, affected_services, confidence, recommended_next_action)— the world-modelling primitive. Confidence is afloat ∈ [0,1]the agent must commit to.- Hypothesis correctness check — if the root cause matches truth, the agent gets an in-episode bonus up to ~0.12 (idempotent — second identical hypothesis scores 0). If wrong, the agent loops back to investigation with a new observation.
rollback_deploy(service)— the irreversible action. Wrong target =unsafe_action_penalty(0.08 medium / 0.12 hard). Correct target setscause_removed = Trueand unblocks restart.restart_service(service)— only valid if scenario requires it. Guard: if cause not removed, premature-restart penalty fires and state re-inherits the bad config.run_check("end_to_end" | "database_recovery")— verification gate. If checks fail, agent loops back to investigation.declare_resolved— terminal action. Guard: if checks not passed,premature_resolution_penalty(0.20 / 0.30) fires.- Episode terminates — terminal state emitted.
- Compute composite from terminal state — the 5-component rubric below evaluates outcome / action_validity / format / anticheat / efficiency, sums to 1.0 with weighted clamping to
[0.01, 0.99]. - Reference scores anchor the rubric — random
0.417(0/36 resolved), naive heuristic0.749(0/12 resolved), scripted-optimal0.938(12/12 resolved). The 0.20 gap from0.80 → 1.00is what GRPO trains into.
Cross-tier extension:
- Strategy chains N Triage episodes, applies a
horizon_decay_factor × mean(per-phase composite)to the final reward. - Operations runs the same lifecycle inside a graph-mutation simulator over a 22-node service topology, same rubric, same horizon-decay weighting.
Reward rubric — the engineering crown jewel
final_reward = 0.45·outcome
+ 0.20·action_validity
+ 0.15·anticheat
+ 0.10·format
+ 0.10·efficiency
────
composite ∈ [0.01, 0.99] (clamped, public score)
outcome root-cause action correct + recovery confirmed
action_validity fraction of step() actions that are well-formed Pydantic
anticheat declare_resolved blocked unless ≥1 query already ran
format submit_hypothesis was called before declare_resolved
efficiency exp(-current_tick / optimal_ticks_for_template)
Each component defends against a specific cheat:
| Cheat strategy | Blocked by |
|---|---|
declare_resolved before any query |
anticheat (0.15) |
Skip submit_hypothesis to save a tick |
format (0.10) |
| Spam hypotheses to fish for partial credit | hypothesis idempotence + action_validity |
| Send malformed actions | action_validity (0.20) |
| Resolve before checks pass | outcome (0.45) + premature_resolution step penalty |
The cleverest piece is the calibration term inside submit_hypothesis:
hypothesis_reward = 0.04·cause_match
+ 0.03·service_match
+ 0.03·action_quality
+ 0.02·calibration
calibration awards
+1.0 confident-correct
+0.5 hedged-correct
-0.2 hedged-wrong
-1.0 confident-wrong
The confidence ∈ [0,1] field is part of the structured HypothesisPayload Pydantic model the agent emits; the calibration sub-term reads it directly. A model that bluffs high confidence on a wrong root cause is worse than one that hedges. That's the world-modelling primitive — the env grades the agent's belief, not just its prediction.
Two CI invariants pin the rubric in place on every commit:
- Heuristic ceiling
[0.65, 0.80]—test_heuristic_ceiling_is_in_bandenforces this band on every template. The 0.20 gap from 0.80 → 1.00 is the GRPO training target. - Scripted-expert floor
≥0.90—test_round2_baseline_resolvesenforces ≥0.90 on the round-2 templates.
Tighten either side and the gradient signal collapses; loosen either side and a memorising model can game the rubric. The band is the load-bearing engineering claim.
Coliseum — parallel-rollout pool server
coliseum/ wraps the Triage env in a lease-based HTTP contract so a GRPO trainer's parallel-rollout side can drive the env without holding an in-process Python instance per worker:
allocate(task_key) -> {ok: true, lease_id}
reset(lease_id, task_meta, run_ctx) -> {ok: true, observation}
exec_tool(lease_id, tool_call) -> {ok: true, observation}
evaluate(lease_id) -> {ok: true, score}
close(lease_id) -> {ok: true}
8-way concurrent rollouts on a single process via per-lease asyncio.Lock; a background reaper evicts idle leases after COLISEUM_LEASE_TTL_S so a crashed worker doesn't leak env instances.
# Boot the pool server
uvicorn coliseum.server:app --host 0.0.0.0 --port 8100
# Drive it from a trainer
export COLISEUM_BASE_URL=http://127.0.0.1:8100
Standard lease-pool pattern — see coliseum/README.md for the full env-var table.
Training & datasets — the honest weak point
The environment is the project. Training scripts orbit around it. We ran a real end-to-end Triage run on the day of the deadline — pipeline works, results are below the heuristic plateau, the gap is real, and we're saying so.
What we ran
Pipeline lives in notebooks/01_triage_train_grpo_qwen25_7b.ipynb — also openable in Colab via the badge: . Target A100 80GB, ~2-3h end-to-end:
- SFT cold-start — Unsloth + Qwen2.5-7B-Instruct (4-bit) + LoRA r=32, 50 steps × batch 16 on a 999-example diverse trajectory corpus. Eval perplexity 1.755 (healthy
[1.5, 3.0]band). Saved tooutputs/qwen25_7b_sft_final/. - GRPO online — TRL's
GRPOTrainer, 40 steps × K=2 rollouts. Reward = composite + scenario-aware first-action bonus. Saved tooutputs/qwen25_7b_grpo_final/. - Held-out eval — 12
__p05scenarios × 3 seeds = 36 episodes per policy, 5 policies compared.
What it produced
| policy | mean | median | p25 | p75 | resolved_rate |
|---|---|---|---|---|---|
| random | 0.342 | 0.378 | 0.340 | 0.380 | 0/36 |
| qwen25-7b-sft-only | 0.379 | 0.380 | 0.378 | 0.380 | 0/36 |
| qwen25-7b-grpo | 0.379 | 0.380 | 0.378 | 0.380 | 0/36 |
| heuristic (queries + correct hypothesis, no remediation) | 0.704 | 0.705 | 0.703 | 0.705 | 0/36 |
| scripted-optimal | 0.938 | 0.939 | 0.937 | 0.940 | 36/36 |
Honest reading:
- SFT lifted the model 11% above random (0.342 → 0.379). Format-learning worked: the trained model produces 100% schema-valid
action_typeJSON. - GRPO did not move the mean further on K=2 / 40 steps / 7B. The advantage signal exists but the budget is too small to overcome the heuristic plateau at 0.704.
- The env's rubric refuses to give partial credit for "looks right" output. A model that emits a polished hypothesis but never calls
rollback_deployis rewarded for zero remediation, and the 0.45outcomeslice stays empty. The numbers don't lie about the work the model didn't do — and that's exactly the rubric's job.
What's bottlenecking us
The training-time weak spot is data scale and step budget, not the env. The 120-episode corpus + 50-step SFT + 40-step GRPO is what fits inside a hackathon weekend on one A100. The env, the rubric, and Coliseum are all sized for the real run; the trajectory corpus and the GRPO budget are not. Scaling either side (more teacher trajectories, more GRPO steps with K=4 rollouts, longer eval window) is the obvious next move and the bottleneck stops being the env.
The training scripts in train/ are working as written for the dataset we have — build_corpus.py produces a clean 60/20/20 quality-stratified split, eval_sweep.py drives the held-out comparison, run_expert_collection.py harvests teacher trajectories. They don't need rewriting; they need more data flowing through them.
Quickstart
5-minute local demo (no API keys, no server, no GPU)
pip install -e .
ollama pull llama3.2
python -m sre_gym.local triage worker_deploy_cascade
The CLI drives UnifiedIncidentEnvironment directly and prints per-tick reward, the 5-component score breakdown, and a final summary.
Live HF Space (Triage tier, hosted)
Open https://huggingface.co/spaces/Madhav189/SystemTruth. Pick a tier, paste an HF token, click â–¶ run eval. Each tick streams the action, env response, reward delta, and the 5-component breakdown.
Local server + Gradio UI
make install
make dev # FastAPI + Gradio on :7860
python -m sre_gym.strategy run cascading_release_train
python -m sre_gym.operations run ecommerce_vibecoded_saas --chaos rls_silent_leak
The FastAPI server speaks the OpenEnv contract (/reset /step /state /tasks /baseline /grader /status /health /metadata /schema) plus an MCP JSON-RPC route at /mcp.
Two-paths agent design
The repo ships two independent paths to a working SRE agent. They share the env contract but trade compute for capability differently.
Path A — verified-runbook skill (zero training)
skill/ packages the env as a Claude Code skill. The agent reads scenario evidence, writes a verified runbook on a clean solve, and reads the runbook on the next attempt. No training, no GPU.
ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"
bash demo/run_demo.sh # end-to-end demo
Path B — GRPO-trained adapter (one A100, ~2–3h on 7B)
The training pipeline above. Path A is what an agent ships today; Path B is what raises the floor on the templates it sees over and over.
Tier-aware Python API
from sre_gym import SREGym, Tier
# Triage — per-step (live FastAPI) or end-to-end
env = SREGym(tier=Tier.TRIAGE)
obs = env.reset(scenario_id="memory_leak_oom__p02")
obs = env.step({"action_type": "rollback_deploy", "service": "worker"})
result = env.run("memory_leak_oom__p02", seed=42)
# Strategy — episodic only (chained Triage episodes)
env = SREGym(tier=Tier.STRATEGY)
result = env.run("cascading_release_train", seed=1)
# Operations — per-step (graph mutations) or end-to-end (state-machine simulator)
env = SREGym(tier=Tier.OPERATIONS)
obs = env.reset(family_id="ecommerce_vibecoded_saas", chaos="rls_silent_leak", seed=1)
obs = env.step({"action_type": "rollback_deploy", "service": "postgres-primary"})
Old tier names (Tier.BASIC, Tier.ADVANCED, Tier.MAX) are preserved as Enum aliases so existing callers keep working; importing them emits a DeprecationWarning.
Tests + lint
make test # green at HEAD
ruff check .
openenv validate . # green
The two CI invariants that keep the rubric calibrated:
test_heuristic_ceiling_is_in_band— naive heuristic in[0.65, 0.80]on every template.test_round2_baseline_resolves— scripted-optimal≥ 0.90on the round-2 templates.
Materials
BLOG.md— the hackathon blog (with all 6 assets indocs/blog/)openenv.yaml— declares the three tiers, runnable kinds, scenario countsdocs/— architecture, per-tier deep dives, reward design, scenario authoringdocs/blog/— visuals: lifecycle, architecture, hero, topology, rubric donut, chaos timeline, two-paths, baselines barskill/— Claude Code skill packaging (Path A)coliseum/— parallel-rollout pool serverdemo/—run_demo.shend-to-end demo,pitch.mdnarrativeeval/— held-out split definition, results directory with the latest eval CSV + plotstrain/data/— teacher trajectories (Claude Opus + Llama-3.3-70B + scripted baselines + 120-episode v2 corpus)notebooks/— Triage SFT→GRPO training (01_triage_train_grpo_qwen25_7b.ipynb), eval comparison (02_triage_eval_compare_all.ipynb), Strategy + Operations walkthroughs
License
Apache 2.0. Built for the OpenEnv-class hackathon, India 2026 — by the Madhav-GPT / dakshdoesdev team.



