SystemTruth / README.md
Madhav189's picture
SystemTruth rebrand: bigger UI, new diagrams, theme-cohesive HF Space
6583a07
---
title: SystemTruth
emoji: 🚨
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
---
# SystemTruth β€” a tier-escalating SRE training environment
> **Hackathon submission β€” OpenEnv-class, India 2026**
>
> - πŸ“– **Blog:** [BLOG.md](BLOG.md)
> - πŸš€ **Live HF Space:** https://huggingface.co/spaces/Madhav189/SystemTruth
> - πŸ’» **GitHub:** https://github.com/Madhav-GPT/SystemTruth
> - πŸ§ͺ **Training notebook (Colab):** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu?usp=sharing) β€” same as [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb)
> - πŸ“Š **Eval results:** [`eval/results/`](eval/results/)
> - πŸ“œ **License:** Apache 2.0
**Each tier escalates a different dimension. Triage escalates compute, Strategy escalates horizon, Operations escalates realism.** That single sentence is the load-bearing claim of the project.
---
## What's in the box (the USP β€” read this first)
SystemTruth is **one runnable RL environment with three personas baked into it**. The same 11-action contract, the same 5-component reward rubric, the same termination shape β€” escalated along three orthogonal axes that map to the three real bottlenecks SRE-agent training loops actually hit.
![SystemTruth architecture β€” three tiers under one shared 11-action interface + 5-component rubric](docs/blog/system_architecture.png)
### One environment, three tiers, three different bottlenecks
| Tier | Bottleneck | Persona | What it teaches |
|---|---|---|---|
| **Triage** | **Compute** | ML student / Kaggle, $30 of HF credits | causal mapping under tight context β€” pre-digested observations, dense reward shaping, 8K context, 11-action space, 8–13 ticks per episode |
| **Strategy** | **Horizon** | Seed-stage startup, $300–500 budget | long-horizon planning across chained incidents β€” multi-incident chains with persistent state, unresolved alerts and pending deploys carry forward, 60–90 ticks |
| **Operations** | **Realism** | Enterprise SRE platform, 8Γ—A100/H100 cluster | authentic tool use against irreversible actions β€” 22-node service graph, 11 chaos patterns pinned to real production post-mortems, 110–180+ actions per episode |
The escalation axis is the entire pitch. Most RL environments stratify by *difficulty* (more scenarios, longer episodes, harder rewards). SystemTruth stratifies by **the dimension that actually limits the training loop for that persona**:
- A junior on-call learning to triage faces a different problem (cognitive efficiency under tight context) than a senior SRE running a multi-incident postmortem (state tracking across long horizons), which faces a different problem from an enterprise platform team operating against an actively chaos-engineered cluster (irreversible actions, partial observability, real wall-clock).
- Their training signals, episode shapes, observation richness, and reward structures should not look the same.
- SystemTruth takes that observation seriously and stratifies its tiers along *the dimension that actually limits the persona's training loop*.
### The shared 11-action contract
Every tier β€” Triage, Strategy, Operations β€” speaks the same eleven Pydantic-validated actions. **One contract, three escalation envelopes:**
```
query_logs(service) query_metrics(service, metric)
query_dependencies(service) query_deploys(service)
rollback_deploy(service) restart_service(service)
isolate_service(service) run_check(check_name)
submit_hypothesis(hypothesis) escalate
declare_resolved
```
A successful episode is `gather evidence β†’ submit_hypothesis β†’ rollback_deploy β†’ restart_service β†’ both run_check pass β†’ declare_resolved`. Wrong rollback target, premature restart, or premature resolution all return negative reward and a typed `failure_type`. The contract refuses to be gamed.
### The episode lifecycle, illustrated
The lifecycle below is the Triage tier in detail; Strategy chains N of them with horizon-decay, Operations runs one of them inside a graph-mutation simulator. **The shape is shared across all three tiers** β€” the simulator under it is what changes.
![SystemTruth episode lifecycle β€” Triage tier, same shape inherited by Strategy and Operations](docs/blog/episode_lifecycle.png)
Eleven numbered stages, each producing a measurable signal:
1. **`reset(scenario_id)`** β€” env emits the initial observation: tick counter, workflow stage, incident summary, active alerts, noise alerts (decoys), service health (cpu/mem/err/latency), user impact, SLO burn rate, checks, allowed actions.
2. **Evidence gathering loop** β€” agent calls `query_logs / query_metrics / query_dependencies / query_deploys`. After every step the env computes a per-tick **shaped reward** as a potential difference (`Ξ” critical_service_health Γ— 0.55 + Ξ” (1 βˆ’ user_impact) Γ— 0.20 + Ξ” (1 βˆ’ slo_burn_rate) Γ— 0.15 + containment_applied Γ— 0.10`) minus `step_cost`, plus `bonus`, minus `penalty`.
3. **`submit_hypothesis(root_cause, affected_services, confidence, recommended_next_action)`** β€” the world-modelling primitive. Confidence is a `float ∈ [0,1]` the agent must commit to.
4. **Hypothesis correctness check** β€” if the root cause matches truth, the agent gets an in-episode bonus up to ~0.12 (idempotent β€” second identical hypothesis scores 0). If wrong, the agent loops back to investigation with a new observation.
5. **`rollback_deploy(service)`** β€” the irreversible action. Wrong target = `unsafe_action_penalty` (0.08 medium / 0.12 hard). Correct target sets `cause_removed = True` and unblocks restart.
6. **`restart_service(service)`** β€” only valid if scenario requires it. Guard: if cause not removed, premature-restart penalty fires and state re-inherits the bad config.
7. **`run_check("end_to_end" | "database_recovery")`** β€” verification gate. If checks fail, agent loops back to investigation.
8. **`declare_resolved`** β€” terminal action. Guard: if checks not passed, `premature_resolution_penalty` (0.20 / 0.30) fires.
9. **Episode terminates** β€” terminal state emitted.
10. **Compute composite from terminal state** β€” the 5-component rubric below evaluates outcome / action_validity / format / anticheat / efficiency, sums to 1.0 with weighted clamping to `[0.01, 0.99]`.
11. **Reference scores anchor the rubric** β€” random `0.417` (0/36 resolved), naive heuristic `0.749` (0/12 resolved), scripted-optimal `0.938` (12/12 resolved). The 0.20 gap from `0.80 β†’ 1.00` is what GRPO trains into.
**Cross-tier extension:**
- **Strategy** chains N Triage episodes, applies a `horizon_decay_factor Γ— mean(per-phase composite)` to the final reward.
- **Operations** runs the same lifecycle inside a graph-mutation simulator over a 22-node service topology, same rubric, same horizon-decay weighting.
### Reward rubric β€” the engineering crown jewel
```
final_reward = 0.45Β·outcome
+ 0.20Β·action_validity
+ 0.15Β·anticheat
+ 0.10Β·format
+ 0.10Β·efficiency
────
composite ∈ [0.01, 0.99] (clamped, public score)
outcome root-cause action correct + recovery confirmed
action_validity fraction of step() actions that are well-formed Pydantic
anticheat declare_resolved blocked unless β‰₯1 query already ran
format submit_hypothesis was called before declare_resolved
efficiency exp(-current_tick / optimal_ticks_for_template)
```
Each component defends against a specific cheat:
| Cheat strategy | Blocked by |
|---|---|
| `declare_resolved` before any query | `anticheat` (0.15) |
| Skip `submit_hypothesis` to save a tick | `format` (0.10) |
| Spam hypotheses to fish for partial credit | hypothesis idempotence + `action_validity` |
| Send malformed actions | `action_validity` (0.20) |
| Resolve before checks pass | `outcome` (0.45) + `premature_resolution` step penalty |
The cleverest piece is the **calibration term inside `submit_hypothesis`**:
```
hypothesis_reward = 0.04Β·cause_match
+ 0.03Β·service_match
+ 0.03Β·action_quality
+ 0.02Β·calibration
calibration awards
+1.0 confident-correct
+0.5 hedged-correct
-0.2 hedged-wrong
-1.0 confident-wrong
```
The `confidence ∈ [0,1]` field is part of the structured `HypothesisPayload` Pydantic model the agent emits; the calibration sub-term reads it directly. **A model that bluffs high confidence on a wrong root cause is worse than one that hedges.** That's the world-modelling primitive β€” the env grades the agent's belief, not just its prediction.
Two CI invariants pin the rubric in place on every commit:
- **Heuristic ceiling `[0.65, 0.80]`** β€” `test_heuristic_ceiling_is_in_band` enforces this band on every template. The 0.20 gap from 0.80 β†’ 1.00 is the GRPO training target.
- **Scripted-expert floor `β‰₯0.90`** β€” `test_round2_baseline_resolves` enforces β‰₯0.90 on the round-2 templates.
Tighten either side and the gradient signal collapses; loosen either side and a memorising model can game the rubric. The band is the load-bearing engineering claim.
---
## Coliseum β€” parallel-rollout pool server
[`coliseum/`](coliseum/) wraps the Triage env in a lease-based HTTP contract so a GRPO trainer's parallel-rollout side can drive the env without holding an in-process Python instance per worker:
```
allocate(task_key) -> {ok: true, lease_id}
reset(lease_id, task_meta, run_ctx) -> {ok: true, observation}
exec_tool(lease_id, tool_call) -> {ok: true, observation}
evaluate(lease_id) -> {ok: true, score}
close(lease_id) -> {ok: true}
```
8-way concurrent rollouts on a single process via per-lease `asyncio.Lock`; a background reaper evicts idle leases after `COLISEUM_LEASE_TTL_S` so a crashed worker doesn't leak env instances.
```bash
# Boot the pool server
uvicorn coliseum.server:app --host 0.0.0.0 --port 8100
# Drive it from a trainer
export COLISEUM_BASE_URL=http://127.0.0.1:8100
```
Standard lease-pool pattern β€” see [`coliseum/README.md`](coliseum/README.md) for the full env-var table.
---
## Training & datasets β€” the honest weak point
The environment is the project. Training scripts orbit around it. We ran a real end-to-end Triage run on the day of the deadline β€” pipeline works, results are below the heuristic plateau, the gap is real, and we're saying so.
### What we ran
Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb) β€” also openable in Colab via the badge: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu?usp=sharing). Target A100 80GB, ~2-3h end-to-end:
1. **SFT cold-start** β€” Unsloth + Qwen2.5-7B-Instruct (4-bit) + LoRA r=32, 50 steps Γ— batch 16 on a 999-example diverse trajectory corpus. Eval perplexity 1.755 (healthy `[1.5, 3.0]` band). Saved to `outputs/qwen25_7b_sft_final/`.
2. **GRPO online** β€” TRL's `GRPOTrainer`, 40 steps Γ— K=2 rollouts. Reward = composite + scenario-aware first-action bonus. Saved to `outputs/qwen25_7b_grpo_final/`.
3. **Held-out eval** β€” 12 `__p05` scenarios Γ— 3 seeds = 36 episodes per policy, 5 policies compared.
### What it produced
![SystemTruth Triage holdout eval β€” Qwen2.5-7B, 12 scenarios Γ— 3 seeds](eval/results/qwen25_7b_comparison_hero.png)
| policy | mean | median | p25 | p75 | resolved_rate |
|---|---|---|---|---|---|
| random | 0.342 | 0.378 | 0.340 | 0.380 | 0/36 |
| **qwen25-7b-sft-only** | **0.379** | 0.380 | 0.378 | 0.380 | 0/36 |
| **qwen25-7b-grpo** | **0.379** | 0.380 | 0.378 | 0.380 | 0/36 |
| heuristic (queries + correct hypothesis, no remediation) | 0.704 | 0.705 | 0.703 | 0.705 | 0/36 |
| scripted-optimal | 0.938 | 0.939 | 0.937 | 0.940 | 36/36 |
![Per-template mean score by policy](eval/results/qwen25_7b_comparison_per_template.png)
Honest reading:
- SFT lifted the model 11% above random (0.342 β†’ 0.379). Format-learning worked: the trained model produces 100% schema-valid `action_type` JSON.
- GRPO did not move the mean further on K=2 / 40 steps / 7B. The advantage signal exists but the budget is too small to overcome the heuristic plateau at 0.704.
- The env's rubric refuses to give partial credit for "looks right" output. A model that emits a polished hypothesis but never calls `rollback_deploy` is rewarded for zero remediation, and the 0.45 `outcome` slice stays empty. **The numbers don't lie about the work the model didn't do** β€” and that's exactly the rubric's job.
### What's bottlenecking us
The training-time weak spot is **data scale and step budget**, not the env. The 120-episode corpus + 50-step SFT + 40-step GRPO is what fits inside a hackathon weekend on one A100. The env, the rubric, and Coliseum are all sized for the real run; the trajectory corpus and the GRPO budget are not. Scaling either side (more teacher trajectories, more GRPO steps with K=4 rollouts, longer eval window) is the obvious next move and the bottleneck stops being the env.
The training scripts in [`train/`](train/) are working as written for the dataset we have β€” `build_corpus.py` produces a clean 60/20/20 quality-stratified split, `eval_sweep.py` drives the held-out comparison, `run_expert_collection.py` harvests teacher trajectories. They don't need rewriting; they need more data flowing through them.
---
## Quickstart
### 5-minute local demo (no API keys, no server, no GPU)
```bash
pip install -e .
ollama pull llama3.2
python -m sre_gym.local triage worker_deploy_cascade
```
The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary.
### Live HF Space (Triage tier, hosted)
Open https://huggingface.co/spaces/Madhav189/SystemTruth. Pick a tier, paste an HF token, click **β–Ά run eval**. Each tick streams the action, env response, reward delta, and the 5-component breakdown.
### Local server + Gradio UI
```bash
make install
make dev # FastAPI + Gradio on :7860
python -m sre_gym.strategy run cascading_release_train
python -m sre_gym.operations run ecommerce_vibecoded_saas --chaos rls_silent_leak
```
The FastAPI server speaks the OpenEnv contract (`/reset /step /state /tasks /baseline /grader /status /health /metadata /schema`) plus an MCP JSON-RPC route at `/mcp`.
---
## Two-paths agent design
The repo ships **two independent paths** to a working SRE agent. They share the env contract but trade compute for capability differently.
### Path A β€” verified-runbook skill (zero training)
[`skill/`](skill/) packages the env as a Claude Code skill. The agent reads scenario evidence, writes a verified runbook on a clean solve, and reads the runbook on the next attempt. No training, no GPU.
```bash
ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"
bash demo/run_demo.sh # end-to-end demo
```
### Path B β€” GRPO-trained adapter (one A100, ~2–3h on 7B)
The training pipeline above. Path A is what an agent ships *today*; Path B is what raises the floor on the templates it sees over and over.
---
## Tier-aware Python API
```python
from sre_gym import SREGym, Tier
# Triage β€” per-step (live FastAPI) or end-to-end
env = SREGym(tier=Tier.TRIAGE)
obs = env.reset(scenario_id="memory_leak_oom__p02")
obs = env.step({"action_type": "rollback_deploy", "service": "worker"})
result = env.run("memory_leak_oom__p02", seed=42)
# Strategy β€” episodic only (chained Triage episodes)
env = SREGym(tier=Tier.STRATEGY)
result = env.run("cascading_release_train", seed=1)
# Operations β€” per-step (graph mutations) or end-to-end (state-machine simulator)
env = SREGym(tier=Tier.OPERATIONS)
obs = env.reset(family_id="ecommerce_vibecoded_saas", chaos="rls_silent_leak", seed=1)
obs = env.step({"action_type": "rollback_deploy", "service": "postgres-primary"})
```
Old tier names (`Tier.BASIC`, `Tier.ADVANCED`, `Tier.MAX`) are preserved as Enum aliases so existing callers keep working; importing them emits a `DeprecationWarning`.
---
## Tests + lint
```bash
make test # green at HEAD
ruff check .
openenv validate . # green
```
The two CI invariants that keep the rubric calibrated:
- `test_heuristic_ceiling_is_in_band` β€” naive heuristic in `[0.65, 0.80]` on every template.
- `test_round2_baseline_resolves` β€” scripted-optimal `β‰₯ 0.90` on the round-2 templates.
---
## Materials
- [`BLOG.md`](BLOG.md) β€” the hackathon blog (with all 6 assets in `docs/blog/`)
- [`openenv.yaml`](openenv.yaml) β€” declares the three tiers, runnable kinds, scenario counts
- [`docs/`](docs/) β€” architecture, per-tier deep dives, reward design, scenario authoring
- [`docs/blog/`](docs/blog/) β€” visuals: lifecycle, architecture, hero, topology, rubric donut, chaos timeline, two-paths, baselines bar
- [`skill/`](skill/) β€” Claude Code skill packaging (Path A)
- [`coliseum/`](coliseum/) β€” parallel-rollout pool server
- [`demo/`](demo/) β€” `run_demo.sh` end-to-end demo, `pitch.md` narrative
- [`eval/`](eval/) β€” held-out split definition, results directory with the latest eval CSV + plots
- [`train/data/`](train/data/) β€” teacher trajectories (Claude Opus + Llama-3.3-70B + scripted baselines + 120-episode v2 corpus)
- [`notebooks/`](notebooks/) — Triage SFT→GRPO training (`01_triage_train_grpo_qwen25_7b.ipynb`), eval comparison (`02_triage_eval_compare_all.ipynb`), Strategy + Operations walkthroughs
---
## License
Apache 2.0. Built for the OpenEnv-class hackathon, India 2026 β€” by the Madhav-GPT / dakshdoesdev team.