Spaces:

Madhav189
/

SystemTruth

Running

App Files Files Community

SystemTruth / README.md

Madhav189

SystemTruth rebrand: bigger UI, new diagrams, theme-cohesive HF Space

6583a07 about 1 month ago

preview code

raw

history blame contribute delete

17.9 kB

	---
	title: SystemTruth
	emoji: 🚨
	colorFrom: red
	colorTo: yellow
	sdk: docker
	app_port: 7860
	pinned: false
	license: apache-2.0
	---

	# SystemTruth — a tier-escalating SRE training environment

	> Hackathon submission — OpenEnv-class, India 2026
	>
	> - 📖 Blog: [BLOG.md](BLOG.md)
	> - 🚀 Live HF Space: https://huggingface.co/spaces/Madhav189/SystemTruth
	> - 💻 GitHub: https://github.com/Madhav-GPT/SystemTruth
	> - 🧪 Training notebook (Colab): [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu?usp=sharing) — same as [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb)
	> - 📊 Eval results: [`eval/results/`](eval/results/)
	> - 📜 License: Apache 2.0

	Each tier escalates a different dimension. Triage escalates compute, Strategy escalates horizon, Operations escalates realism. That single sentence is the load-bearing claim of the project.

	---

	## What's in the box (the USP — read this first)

	SystemTruth is one runnable RL environment with three personas baked into it. The same 11-action contract, the same 5-component reward rubric, the same termination shape — escalated along three orthogonal axes that map to the three real bottlenecks SRE-agent training loops actually hit.

	![SystemTruth architecture — three tiers under one shared 11-action interface + 5-component rubric](docs/blog/system_architecture.png)

	### One environment, three tiers, three different bottlenecks

	\| Tier \| Bottleneck \| Persona \| What it teaches \|
	\|---\|---\|---\|---\|
	\| Triage \| Compute \| ML student / Kaggle, $30 of HF credits \| causal mapping under tight context — pre-digested observations, dense reward shaping, 8K context, 11-action space, 8–13 ticks per episode \|
	\| Strategy \| Horizon \| Seed-stage startup, $300–500 budget \| long-horizon planning across chained incidents — multi-incident chains with persistent state, unresolved alerts and pending deploys carry forward, 60–90 ticks \|
	\| Operations \| Realism \| Enterprise SRE platform, 8×A100/H100 cluster \| authentic tool use against irreversible actions — 22-node service graph, 11 chaos patterns pinned to real production post-mortems, 110–180+ actions per episode \|

	The escalation axis is the entire pitch. Most RL environments stratify by difficulty (more scenarios, longer episodes, harder rewards). SystemTruth stratifies by the dimension that actually limits the training loop for that persona:

	- A junior on-call learning to triage faces a different problem (cognitive efficiency under tight context) than a senior SRE running a multi-incident postmortem (state tracking across long horizons), which faces a different problem from an enterprise platform team operating against an actively chaos-engineered cluster (irreversible actions, partial observability, real wall-clock).
	- Their training signals, episode shapes, observation richness, and reward structures should not look the same.
	- SystemTruth takes that observation seriously and stratifies its tiers along the dimension that actually limits the persona's training loop.

	### The shared 11-action contract

	Every tier — Triage, Strategy, Operations — speaks the same eleven Pydantic-validated actions. One contract, three escalation envelopes:

	```
	query_logs(service) query_metrics(service, metric)
	query_dependencies(service) query_deploys(service)
	rollback_deploy(service) restart_service(service)
	isolate_service(service) run_check(check_name)
	submit_hypothesis(hypothesis) escalate
	declare_resolved
	```

	A successful episode is `gather evidence → submit_hypothesis → rollback_deploy → restart_service → both run_check pass → declare_resolved`. Wrong rollback target, premature restart, or premature resolution all return negative reward and a typed `failure_type`. The contract refuses to be gamed.

	### The episode lifecycle, illustrated

	The lifecycle below is the Triage tier in detail; Strategy chains N of them with horizon-decay, Operations runs one of them inside a graph-mutation simulator. The shape is shared across all three tiers — the simulator under it is what changes.

	![SystemTruth episode lifecycle — Triage tier, same shape inherited by Strategy and Operations](docs/blog/episode_lifecycle.png)

	Eleven numbered stages, each producing a measurable signal:

	1. `reset(scenario_id)` — env emits the initial observation: tick counter, workflow stage, incident summary, active alerts, noise alerts (decoys), service health (cpu/mem/err/latency), user impact, SLO burn rate, checks, allowed actions.
	2. Evidence gathering loop — agent calls `query_logs / query_metrics / query_dependencies / query_deploys`. After every step the env computes a per-tick shaped reward as a potential difference (`Δ critical_service_health × 0.55 + Δ (1 − user_impact) × 0.20 + Δ (1 − slo_burn_rate) × 0.15 + containment_applied × 0.10`) minus `step_cost`, plus `bonus`, minus `penalty`.
	3. `submit_hypothesis(root_cause, affected_services, confidence, recommended_next_action)` — the world-modelling primitive. Confidence is a `float ∈ [0,1]` the agent must commit to.
	4. Hypothesis correctness check — if the root cause matches truth, the agent gets an in-episode bonus up to ~0.12 (idempotent — second identical hypothesis scores 0). If wrong, the agent loops back to investigation with a new observation.
	5. `rollback_deploy(service)` — the irreversible action. Wrong target = `unsafe_action_penalty` (0.08 medium / 0.12 hard). Correct target sets `cause_removed = True` and unblocks restart.
	6. `restart_service(service)` — only valid if scenario requires it. Guard: if cause not removed, premature-restart penalty fires and state re-inherits the bad config.
	7. `run_check("end_to_end" \| "database_recovery")` — verification gate. If checks fail, agent loops back to investigation.
	8. `declare_resolved` — terminal action. Guard: if checks not passed, `premature_resolution_penalty` (0.20 / 0.30) fires.
	9. Episode terminates — terminal state emitted.
	10. Compute composite from terminal state — the 5-component rubric below evaluates outcome / action_validity / format / anticheat / efficiency, sums to 1.0 with weighted clamping to `[0.01, 0.99]`.
	11. Reference scores anchor the rubric — random `0.417` (0/36 resolved), naive heuristic `0.749` (0/12 resolved), scripted-optimal `0.938` (12/12 resolved). The 0.20 gap from `0.80 → 1.00` is what GRPO trains into.

	Cross-tier extension:
	- Strategy chains N Triage episodes, applies a `horizon_decay_factor × mean(per-phase composite)` to the final reward.
	- Operations runs the same lifecycle inside a graph-mutation simulator over a 22-node service topology, same rubric, same horizon-decay weighting.

	### Reward rubric — the engineering crown jewel

	```
	final_reward = 0.45·outcome
	+ 0.20·action_validity
	+ 0.15·anticheat
	+ 0.10·format
	+ 0.10·efficiency
	────
	composite ∈ [0.01, 0.99] (clamped, public score)

	outcome root-cause action correct + recovery confirmed
	action_validity fraction of step() actions that are well-formed Pydantic
	anticheat declare_resolved blocked unless ≥1 query already ran
	format submit_hypothesis was called before declare_resolved
	efficiency exp(-current_tick / optimal_ticks_for_template)
	```

	Each component defends against a specific cheat:

	\| Cheat strategy \| Blocked by \|
	\|---\|---\|
	\| `declare_resolved` before any query \| `anticheat` (0.15) \|
	\| Skip `submit_hypothesis` to save a tick \| `format` (0.10) \|
	\| Spam hypotheses to fish for partial credit \| hypothesis idempotence + `action_validity` \|
	\| Send malformed actions \| `action_validity` (0.20) \|
	\| Resolve before checks pass \| `outcome` (0.45) + `premature_resolution` step penalty \|

	The cleverest piece is the calibration term inside `submit_hypothesis`:

	```
	hypothesis_reward = 0.04·cause_match
	+ 0.03·service_match
	+ 0.03·action_quality
	+ 0.02·calibration

	calibration awards
	+1.0 confident-correct
	+0.5 hedged-correct
	-0.2 hedged-wrong
	-1.0 confident-wrong
	```

	The `confidence ∈ [0,1]` field is part of the structured `HypothesisPayload` Pydantic model the agent emits; the calibration sub-term reads it directly. A model that bluffs high confidence on a wrong root cause is worse than one that hedges. That's the world-modelling primitive — the env grades the agent's belief, not just its prediction.

	Two CI invariants pin the rubric in place on every commit:
	- Heuristic ceiling `[0.65, 0.80]` — `test_heuristic_ceiling_is_in_band` enforces this band on every template. The 0.20 gap from 0.80 → 1.00 is the GRPO training target.
	- Scripted-expert floor `≥0.90` — `test_round2_baseline_resolves` enforces ≥0.90 on the round-2 templates.

	Tighten either side and the gradient signal collapses; loosen either side and a memorising model can game the rubric. The band is the load-bearing engineering claim.

	---

	## Coliseum — parallel-rollout pool server

	[`coliseum/`](coliseum/) wraps the Triage env in a lease-based HTTP contract so a GRPO trainer's parallel-rollout side can drive the env without holding an in-process Python instance per worker:

	```
	allocate(task_key) -> {ok: true, lease_id}
	reset(lease_id, task_meta, run_ctx) -> {ok: true, observation}
	exec_tool(lease_id, tool_call) -> {ok: true, observation}
	evaluate(lease_id) -> {ok: true, score}
	close(lease_id) -> {ok: true}
	```

	8-way concurrent rollouts on a single process via per-lease `asyncio.Lock`; a background reaper evicts idle leases after `COLISEUM_LEASE_TTL_S` so a crashed worker doesn't leak env instances.

	```bash
	# Boot the pool server
	uvicorn coliseum.server:app --host 0.0.0.0 --port 8100

	# Drive it from a trainer
	export COLISEUM_BASE_URL=http://127.0.0.1:8100
	```

	Standard lease-pool pattern — see [`coliseum/README.md`](coliseum/README.md) for the full env-var table.

	---

	## Training & datasets — the honest weak point

	The environment is the project. Training scripts orbit around it. We ran a real end-to-end Triage run on the day of the deadline — pipeline works, results are below the heuristic plateau, the gap is real, and we're saying so.

	### What we ran

	Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb) — also openable in Colab via the badge: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu?usp=sharing). Target A100 80GB, ~2-3h end-to-end:

	1. SFT cold-start — Unsloth + Qwen2.5-7B-Instruct (4-bit) + LoRA r=32, 50 steps × batch 16 on a 999-example diverse trajectory corpus. Eval perplexity 1.755 (healthy `[1.5, 3.0]` band). Saved to `outputs/qwen25_7b_sft_final/`.
	2. GRPO online — TRL's `GRPOTrainer`, 40 steps × K=2 rollouts. Reward = composite + scenario-aware first-action bonus. Saved to `outputs/qwen25_7b_grpo_final/`.
	3. Held-out eval — 12 `__p05` scenarios × 3 seeds = 36 episodes per policy, 5 policies compared.

	### What it produced

	![SystemTruth Triage holdout eval — Qwen2.5-7B, 12 scenarios × 3 seeds](eval/results/qwen25_7b_comparison_hero.png)

	\| policy \| mean \| median \| p25 \| p75 \| resolved_rate \|
	\|---\|---\|---\|---\|---\|---\|
	\| random \| 0.342 \| 0.378 \| 0.340 \| 0.380 \| 0/36 \|
	\| qwen25-7b-sft-only \| 0.379 \| 0.380 \| 0.378 \| 0.380 \| 0/36 \|
	\| qwen25-7b-grpo \| 0.379 \| 0.380 \| 0.378 \| 0.380 \| 0/36 \|
	\| heuristic (queries + correct hypothesis, no remediation) \| 0.704 \| 0.705 \| 0.703 \| 0.705 \| 0/36 \|
	\| scripted-optimal \| 0.938 \| 0.939 \| 0.937 \| 0.940 \| 36/36 \|

	![Per-template mean score by policy](eval/results/qwen25_7b_comparison_per_template.png)

	Honest reading:
	- SFT lifted the model 11% above random (0.342 → 0.379). Format-learning worked: the trained model produces 100% schema-valid `action_type` JSON.
	- GRPO did not move the mean further on K=2 / 40 steps / 7B. The advantage signal exists but the budget is too small to overcome the heuristic plateau at 0.704.
	- The env's rubric refuses to give partial credit for "looks right" output. A model that emits a polished hypothesis but never calls `rollback_deploy` is rewarded for zero remediation, and the 0.45 `outcome` slice stays empty. The numbers don't lie about the work the model didn't do — and that's exactly the rubric's job.

	### What's bottlenecking us

	The training-time weak spot is data scale and step budget, not the env. The 120-episode corpus + 50-step SFT + 40-step GRPO is what fits inside a hackathon weekend on one A100. The env, the rubric, and Coliseum are all sized for the real run; the trajectory corpus and the GRPO budget are not. Scaling either side (more teacher trajectories, more GRPO steps with K=4 rollouts, longer eval window) is the obvious next move and the bottleneck stops being the env.

	The training scripts in [`train/`](train/) are working as written for the dataset we have — `build_corpus.py` produces a clean 60/20/20 quality-stratified split, `eval_sweep.py` drives the held-out comparison, `run_expert_collection.py` harvests teacher trajectories. They don't need rewriting; they need more data flowing through them.

	---

	## Quickstart

	### 5-minute local demo (no API keys, no server, no GPU)

	```bash
	pip install -e .
	ollama pull llama3.2
	python -m sre_gym.local triage worker_deploy_cascade
	```

	The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary.

	### Live HF Space (Triage tier, hosted)

	Open https://huggingface.co/spaces/Madhav189/SystemTruth. Pick a tier, paste an HF token, click ▶ run eval. Each tick streams the action, env response, reward delta, and the 5-component breakdown.

	### Local server + Gradio UI

	```bash
	make install
	make dev # FastAPI + Gradio on :7860
	python -m sre_gym.strategy run cascading_release_train
	python -m sre_gym.operations run ecommerce_vibecoded_saas --chaos rls_silent_leak
	```

	The FastAPI server speaks the OpenEnv contract (`/reset /step /state /tasks /baseline /grader /status /health /metadata /schema`) plus an MCP JSON-RPC route at `/mcp`.

	---

	## Two-paths agent design

	The repo ships two independent paths to a working SRE agent. They share the env contract but trade compute for capability differently.

	### Path A — verified-runbook skill (zero training)

	[`skill/`](skill/) packages the env as a Claude Code skill. The agent reads scenario evidence, writes a verified runbook on a clean solve, and reads the runbook on the next attempt. No training, no GPU.

	```bash
	ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"
	bash demo/run_demo.sh # end-to-end demo
	```

	### Path B — GRPO-trained adapter (one A100, ~2–3h on 7B)

	The training pipeline above. Path A is what an agent ships today; Path B is what raises the floor on the templates it sees over and over.

	---

	## Tier-aware Python API

	```python
	from sre_gym import SREGym, Tier

	# Triage — per-step (live FastAPI) or end-to-end
	env = SREGym(tier=Tier.TRIAGE)
	obs = env.reset(scenario_id="memory_leak_oom__p02")
	obs = env.step({"action_type": "rollback_deploy", "service": "worker"})
	result = env.run("memory_leak_oom__p02", seed=42)

	# Strategy — episodic only (chained Triage episodes)
	env = SREGym(tier=Tier.STRATEGY)
	result = env.run("cascading_release_train", seed=1)

	# Operations — per-step (graph mutations) or end-to-end (state-machine simulator)
	env = SREGym(tier=Tier.OPERATIONS)
	obs = env.reset(family_id="ecommerce_vibecoded_saas", chaos="rls_silent_leak", seed=1)
	obs = env.step({"action_type": "rollback_deploy", "service": "postgres-primary"})
	```

	Old tier names (`Tier.BASIC`, `Tier.ADVANCED`, `Tier.MAX`) are preserved as Enum aliases so existing callers keep working; importing them emits a `DeprecationWarning`.

	---

	## Tests + lint

	```bash
	make test # green at HEAD
	ruff check .
	openenv validate . # green
	```

	The two CI invariants that keep the rubric calibrated:

	- `test_heuristic_ceiling_is_in_band` — naive heuristic in `[0.65, 0.80]` on every template.
	- `test_round2_baseline_resolves` — scripted-optimal `≥ 0.90` on the round-2 templates.

	---

	## Materials

	- [`BLOG.md`](BLOG.md) — the hackathon blog (with all 6 assets in `docs/blog/`)
	- [`openenv.yaml`](openenv.yaml) — declares the three tiers, runnable kinds, scenario counts
	- [`docs/`](docs/) — architecture, per-tier deep dives, reward design, scenario authoring
	- [`docs/blog/`](docs/blog/) — visuals: lifecycle, architecture, hero, topology, rubric donut, chaos timeline, two-paths, baselines bar
	- [`skill/`](skill/) — Claude Code skill packaging (Path A)
	- [`coliseum/`](coliseum/) — parallel-rollout pool server
	- [`demo/`](demo/) — `run_demo.sh` end-to-end demo, `pitch.md` narrative
	- [`eval/`](eval/) — held-out split definition, results directory with the latest eval CSV + plots
	- [`train/data/`](train/data/) — teacher trajectories (Claude Opus + Llama-3.3-70B + scripted baselines + 120-episode v2 corpus)
	- [`notebooks/`](notebooks/) — Triage SFT→GRPO training (`01_triage_train_grpo_qwen25_7b.ipynb`), eval comparison (`02_triage_eval_compare_all.ipynb`), Strategy + Operations walkthroughs

	---

	## License

	Apache 2.0. Built for the OpenEnv-class hackathon, India 2026 — by the Madhav-GPT / dakshdoesdev team.