Spaces:

Madhav189
/

SystemTruth

Running

Madhav189 commited on 28 days ago

Commit

6583a07

1 Parent(s): e8774c9

SystemTruth rebrand: bigger UI, new diagrams, theme-cohesive HF Space

- Rebrand sre-env -> SystemTruth across user-facing surfaces:
* GitHub: github.com/Madhav-GPT/SystemTruth
* HF Space: huggingface.co/spaces/Madhav189/SystemTruth
* Live URL: Madhav189-SystemTruth.hf.space
* HF Space header brand mark + tagline
Internal package name (sre_gym/) is preserved as backwards-compat;
the user-facing surface is fully renamed.

- README rewrite: much bigger "what's in the box" section explaining
the three-tier USP (Triage compute / Strategy horizon / Operations
realism). Embeds the 4 visual assets:
* docs/blog/system_architecture.png (architecture diagram)
* docs/blog/episode_lifecycle.png (lifecycle diagram)
* eval/results/qwen25_7b_comparison_hero.png (hero bar)
* eval/results/qwen25_7b_comparison_per_template.png (per-template)
Adds the Colab badge link to
colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu
for the training notebook.

- HF Space UI fixes per user critique:
* Inputs lighter (#1f2630, was bg-input #161b22) and bigger
(padding 12px 14px, font-size 13px, min-height 42px, border-radius
4px) for usability.
* Tier cards now use phosphor-green accent (matches header brand mark)
instead of out-of-place blue. New text reads
"Triage / Strategy / Operations" instead of "Basic / Advanced / Max".
* Controls row now stacks vertically: run/stop/reset buttons on top,
metrics + rubric bar BELOW (was side-by-side, cramped).
* Terminal min-height reduced 480 -> 280 so it's visible above the
fold without scrolling on a typical laptop viewport.
* RUN EVAL button uses brand phosphor green (was generic Gradio
success-green); rubric bars + resolved-rate value also use the
brand color for theme cohesion.

- BLOG.md title updated: "sre-gym" -> "SystemTruth", body refs follow.

- execution.md: clone URL points at Madhav-GPT/SystemTruth.

- openenv.yaml: huggingface.space_id and github_repo updated.

- Eval result PNGs (qwen25_7b_comparison_hero + per_template)
added to git via .gitignore exception so the README + BLOG
references resolve without a separate fetch.

Files changed (11) hide show

.gitignore +3 -0
BLOG.md +6 -6
README.md +98 -133
app.py +94 -73
docs/TRIAGE_TIER.md +1 -1
docs/blog/episode_lifecycle.png +3 -0
docs/blog/system_architecture.png +3 -0
eval/results/qwen25_7b_comparison_hero.png +3 -0
eval/results/qwen25_7b_comparison_per_template.png +3 -0
execution.md +4 -4
openenv.yaml +2 -2

.gitignore CHANGED Viewed

@@ -29,3 +29,6 @@ eval/results/*.png
 eval/results/*.jsonl
 !eval/results/.gitkeep
 !eval/results/README.md

 eval/results/*.jsonl
 !eval/results/.gitkeep
 !eval/results/README.md
+# Exception: keep the canonical charts referenced from README/BLOG
+!eval/results/qwen25_7b_comparison_hero.png
+!eval/results/qwen25_7b_comparison_per_template.png

BLOG.md CHANGED Viewed

@@ -1,12 +1,12 @@
 ---
-title: "sre-gym — three tiers of SRE incident-response, one rubric that won't let you fake it"
 thumbnail: docs/blog/hero_three_tiers.png
 authors:
   - user: Madhav189
   - user: dakshdoesdev
 ---
-# sre-gym — three tiers of SRE incident-response, one rubric that won't let you fake it
 **TL;DR**
@@ -18,7 +18,7 @@ authors:
 ## Why this matters (read this first)
-Calibrated incident-response is the capability gap. Every general-purpose LLM is bad at it: they hallucinate confident root causes, over-trust the loudest signal, skip verification, and declare incidents resolved before checking anything. Those failure modes are invisible in chat demos and catastrophic in production. **sre-gym makes them legible enough to measure, then small enough to fix** — and exposes the env via the OpenEnv contract so any RL stack can train against it.
 We treat incident-response as a small **world-modelling** problem: the agent has to maintain a hidden-state estimate of which service is actually broken, update it from noisy observations, and commit to irreversible actions under uncertainty. The 5-component rubric grades the *mechanical signature* of that loop — evidence first, hypothesis with calibrated confidence, remediation, verification, only then resolution — instead of rewarding output that merely looks right.
@@ -37,7 +37,7 @@ ollama pull llama3.2
 python -m sre_gym.local triage worker_deploy_cascade
 ```
-The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary. Same code path as the HF Space at `https://huggingface.co/spaces/Madhav189/sre-env` — just without the Gradio UI in front of it.
 ## Three tiers, three bottlenecks
@@ -197,12 +197,12 @@ Reading post-mortems, not blog posts. Fly.io's gossip-protocol deadlock from Oct
 The Triage env is live. Pick a scenario, pick a model provider, watch each tick stream the action, env response, reward delta, and rubric breakdown.
-<iframe src="https://Madhav189-sre-env.hf.space" frameborder="0" width="100%" height="800"></iframe>
 For the per-tier deep dives: [`docs/TRIAGE_TIER.md`](docs/TRIAGE_TIER.md) · [`docs/STRATEGY_TIER.md`](docs/STRATEGY_TIER.md) · [`docs/OPERATIONS_TIER.md`](docs/OPERATIONS_TIER.md). For the rubric defense: [`docs/REWARD_DESIGN.md`](docs/REWARD_DESIGN.md). For the architectural narrative: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md). For the operator guide: [`execution.md`](execution.md). The training notebook lives at [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb).
 ## The claim
-sre-gym is the first SRE training environment that grades calibrated confidence as a first-class signal. The rubric tells you exactly where your model is bluffing — to two decimal places, on every commit, with a CI invariant that fails the build if the heuristic ceiling drifts out of band. Train against it and the hidden-state estimate inside your model gets sharper episode by episode. Skip the rubric and your agent stays a chat-window demo.
 Built for the OpenEnv-class hackathon, India 2026 — by the Madhav-GPT / dakshdoesdev team. Apache 2.0.

 ---
+title: "SystemTruth — three tiers of SRE incident-response, one rubric that won't let you fake it"
 thumbnail: docs/blog/hero_three_tiers.png
 authors:
   - user: Madhav189
   - user: dakshdoesdev
 ---
+# SystemTruth — three tiers of SRE incident-response, one rubric that won't let you fake it
 **TL;DR**
 ## Why this matters (read this first)
+Calibrated incident-response is the capability gap. Every general-purpose LLM is bad at it: they hallucinate confident root causes, over-trust the loudest signal, skip verification, and declare incidents resolved before checking anything. Those failure modes are invisible in chat demos and catastrophic in production. **SystemTruth makes them legible enough to measure, then small enough to fix** — and exposes the env via the OpenEnv contract so any RL stack can train against it.
 We treat incident-response as a small **world-modelling** problem: the agent has to maintain a hidden-state estimate of which service is actually broken, update it from noisy observations, and commit to irreversible actions under uncertainty. The 5-component rubric grades the *mechanical signature* of that loop — evidence first, hypothesis with calibrated confidence, remediation, verification, only then resolution — instead of rewarding output that merely looks right.
 python -m sre_gym.local triage worker_deploy_cascade
 ```
+The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary. Same code path as the HF Space at `https://huggingface.co/spaces/Madhav189/SystemTruth` — just without the Gradio UI in front of it.
 ## Three tiers, three bottlenecks
 The Triage env is live. Pick a scenario, pick a model provider, watch each tick stream the action, env response, reward delta, and rubric breakdown.
+<iframe src="https://Madhav189-SystemTruth.hf.space" frameborder="0" width="100%" height="800"></iframe>
 For the per-tier deep dives: [`docs/TRIAGE_TIER.md`](docs/TRIAGE_TIER.md) · [`docs/STRATEGY_TIER.md`](docs/STRATEGY_TIER.md) · [`docs/OPERATIONS_TIER.md`](docs/OPERATIONS_TIER.md). For the rubric defense: [`docs/REWARD_DESIGN.md`](docs/REWARD_DESIGN.md). For the architectural narrative: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md). For the operator guide: [`execution.md`](execution.md). The training notebook lives at [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb).
 ## The claim
+SystemTruth is the first SRE training environment that grades calibrated confidence as a first-class signal. The rubric tells you exactly where your model is bluffing — to two decimal places, on every commit, with a CI invariant that fails the build if the heuristic ceiling drifts out of band. Train against it and the hidden-state estimate inside your model gets sharper episode by episode. Skip the rubric and your agent stays a chat-window demo.
 Built for the OpenEnv-class hackathon, India 2026 — by the Madhav-GPT / dakshdoesdev team. Apache 2.0.

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: SRE Gym
 emoji: 🚨
 colorFrom: red
 colorTo: yellow
@@ -9,72 +9,44 @@ pinned: false
 license: apache-2.0
 ---
-# sre-gym — a tier-escalating SRE training environment
 > **Hackathon submission — OpenEnv-class, India 2026**
 >
 > - 📖 **Blog:** [BLOG.md](BLOG.md)
-> - 🚀 **Live HF Space:** https://huggingface.co/spaces/Madhav189/sre-env
-> - 💻 **GitHub:** https://github.com/Madhav-GPT/sre-env
-> - 🧪 **Training notebook:** [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb)
 > - 📊 **Eval results:** [`eval/results/`](eval/results/)
 > - 📜 **License:** Apache 2.0
 **Each tier escalates a different dimension. Triage escalates compute, Strategy escalates horizon, Operations escalates realism.** That single sentence is the load-bearing claim of the project.
-The repo's centre of gravity, in priority order:
-1. **The environment** — 12 incident templates × 6 procgen variants = 72 deterministic scenarios, exposed via the OpenEnv contract (`/reset` / `/step`) on a FastAPI server. Same code path serves the Gradio UI mounted at `/`.
-2. **The reward rubric** — a 5-component composite that sums to exactly 1.0, with a heuristic ceiling pinned to `[0.65, 0.80]` and a scripted-expert floor at `≥0.90`, both enforced by CI invariants on every commit. Includes a calibration term inside `submit_hypothesis` that grades confident-wrong twice as harshly as hedged-wrong — a small **world-modelling** primitive: the agent has to maintain a belief over root causes and emit a calibrated confidence estimate.
-3. **Coliseum** — a parallel-rollout pool server that turns the env into a lease-based HTTP service so any GRPO trainer can drive K-rollouts-per-scenario without holding a Python env per worker.
-4. **Training & datasets (the honest weak point)** — an end-to-end SFT → GRPO pipeline on Qwen2.5-7B-Instruct, trained against a 120-episode trajectory corpus harvested from the env. The pipeline runs cleanly; the corpus and step budget are smaller than they need to be to break the heuristic ceiling on held-out scenarios. **The env is ready to train against; we ran out of compute before the model was.**
----
-## What's in the box
-| Tier | Runnable kind | Scenarios | What "running" means |
-|---|---|---|---|
-| **Triage** | live HTTP env | 12 templates × 6 entries each (1 base + 5 procgen) = **72 scenarios** | `/reset` + `/step` against the FastAPI server in this Docker image. The Gradio UI drives episodes end-to-end via the same routes. |
-| **Strategy** | Python orchestrator | 3 reference YAML scenarios | `sre_gym.strategy.runner.run_strategy` chains Triage episodes together, threading horizon state (unresolved alerts, pending deploys, tech-debt counter, horizon-decay reward). The 28-action universe in the YAML is design spec; the runner uses the Triage 11 actions. |
-| **Operations** | Python state-machine simulator | 1 family with 11 chaos patterns | `sre_gym.operations.runner.run_operations` mutates an in-memory 22-node service graph. Same Triage 11 actions. The compose stack alongside the simulator describes the topology an enterprise team would lift into a real cluster — the simulator runs without that lift. |
-The escalation axis is the point: each tier hardens a different bottleneck of building SRE agents in production.
 ---
-## Quickstart
-### 5-minute local demo (no API keys, no server, no GPU)
-```bash
-pip install -e .
-ollama pull llama3.2
-python -m sre_gym.local triage worker_deploy_cascade
-```
-The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary. See [`sre_gym/local.py`](sre_gym/local.py) for the full flag set.
-### Live HF Space (Triage tier, hosted)
-Open https://huggingface.co/spaces/Madhav189/sre-env. Pick a scenario and a model provider, click **▶ run eval**. Each tick streams the action, env response, reward delta, and the 5-component breakdown.
-### Local server + Gradio UI
-```bash
-make install
-make dev                                              # FastAPI + Gradio on :7860
-python -m sre_gym.strategy run cascading_release_train
-python -m sre_gym.operations run ecommerce_vibecoded_saas --chaos rls_silent_leak
-```
-The FastAPI server speaks the OpenEnv contract (`/reset /step /state /tasks /baseline /grader /status /health /metadata /schema`) plus an MCP JSON-RPC route at `/mcp`.
----
-## The Triage tier — the runnable contract
-12 base templates of one-incident-at-a-time scenarios; each generates 5 procgen variants for 72 scenarios total. The agent has 11 bounded actions:
 ```
 query_logs(service)            query_metrics(service, metric)
@@ -85,17 +57,33 @@ submit_hypothesis(hypothesis)  escalate
 declare_resolved
 ```
-A successful episode looks like: `gather evidence → submit_hypothesis → rollback_deploy → restart_service → both run_check pass → declare_resolved`. Wrong rollback target, premature restart, or premature resolution all return negative reward and a typed `failure_type`.
-Services live in a 4-node topology (`api-gateway / cache / database / worker`) plus an 11-service noise-decoy pool that surfaces in alerts as decoys but never in queries. Each scenario specifies a root cause, the correct rollback target, the resolution check that must pass, and the decoy traps. See [`docs/TRIAGE_TIER.md`](docs/TRIAGE_TIER.md) for the per-template skill table.
-The Triage server is **the** runnable contract — Strategy and Operations chain Triage episodes; Coliseum (below) wraps Triage in a lease-based HTTP shape; the local CLI imports the Triage env in-process. Everything else is a runner shape on top.
----
-## The reward rubric — the engineering crown jewel
-Triage uses a **5-component rubric** that sums to exactly 1.0 — see [`docs/REWARD_DESIGN.md`](docs/REWARD_DESIGN.md):
 ```
 final_reward = 0.45·outcome
@@ -113,14 +101,7 @@ final_reward = 0.45·outcome
   efficiency       exp(-current_tick / optimal_ticks_for_template)
 ```
-Plus per-tick *shaped* reward (the change in incident-health potential) for dense GRPO signal. Strategy and Operations reuse the Triage rubric and apply a horizon-decay factor over per-phase composites.
-Two reference scores anchor the rubric and are CI-pinned:
-- **Heuristic ceiling `[0.65, 0.80]`** — a naive policy that gathers evidence and submits the correct hypothesis but never remediates lands here. Enforced by `test_heuristic_ceiling_is_in_band` across all 12 templates. The 0.20 gap from 0.80 → 1.00 is the GRPO training target.
-- **Scripted-expert floor `≥0.90`** — the optimal canonical solve scores ~0.94 on every template. Enforced by `test_round2_baseline_resolves`.
-Adversarial cheats are first-class:
 | Cheat strategy | Blocked by |
 |---|---|
@@ -145,13 +126,19 @@ calibration awards
    -1.0   confident-wrong
 ```
-The `confidence ∈ [0,1]` field is part of the structured `HypothesisPayload` Pydantic model the agent emits; the calibration sub-term reads it directly. **A model that bluffs high confidence on a wrong root cause is worse than one that hedges.** That's the world-modelling primitive — the env is grading the agent's belief, not just its prediction.
 ---
 ## Coliseum — parallel-rollout pool server
-[`coliseum/`](coliseum/) wraps the Triage env in a lease-based HTTP contract so a GRPO trainer's parallel-rollout side can drive the env without holding an in-process `UnifiedIncidentEnvironment` per worker:
 ```
 allocate(task_key)                    -> {ok: true, lease_id}
@@ -177,11 +164,11 @@ Standard lease-pool pattern — see [`coliseum/README.md`](coliseum/README.md) f
 ## Training & datasets — the honest weak point
-The environment is the project. Training scripts orbit around it. We ran a real end-to-end Triage run on the day of the deadline — pipeline works, results are below the heuristic, the gap is real, and we're saying so.
 ### What we ran
-Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb) (target: A100 80GB, ~2-3h end-to-end):
 1. **SFT cold-start** — Unsloth + Qwen2.5-7B-Instruct (4-bit) + LoRA r=32, 50 steps × batch 16 on a 999-example diverse trajectory corpus. Eval perplexity 1.755 (healthy `[1.5, 3.0]` band). Saved to `outputs/qwen25_7b_sft_final/`.
 2. **GRPO online** — TRL's `GRPOTrainer`, 40 steps × K=2 rollouts. Reward = composite + scenario-aware first-action bonus. Saved to `outputs/qwen25_7b_grpo_final/`.
@@ -189,6 +176,8 @@ Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/0
 ### What it produced
 | policy | mean | median | p25 | p75 | resolved_rate |
 |---|---|---|---|---|---|
 | random | 0.342 | 0.378 | 0.340 | 0.380 | 0/36 |
@@ -197,6 +186,8 @@ Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/0
 | heuristic (queries + correct hypothesis, no remediation) | 0.704 | 0.705 | 0.703 | 0.705 | 0/36 |
 | scripted-optimal | 0.938 | 0.939 | 0.937 | 0.940 | 36/36 |
 Honest reading:
 - SFT lifted the model 11% above random (0.342 → 0.379). Format-learning worked: the trained model produces 100% schema-valid `action_type` JSON.
 - GRPO did not move the mean further on K=2 / 40 steps / 7B. The advantage signal exists but the budget is too small to overcome the heuristic plateau at 0.704.
@@ -210,6 +201,35 @@ The training scripts in [`train/`](train/) are working as written for the datase
 ---
 ## Two-paths agent design
 The repo ships **two independent paths** to a working SRE agent. They share the env contract but trade compute for capability differently.
@@ -223,22 +243,12 @@ ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"
 bash demo/run_demo.sh                                # end-to-end demo
 ```
-12 verified-runbook drafts ship in [`skill/verified-runbooks/`](skill/verified-runbooks/) — one per Triage template. The skill validates them by re-running the env after each solve.
 ### Path B — GRPO-trained adapter (one A100, ~2–3h on 7B)
 The training pipeline above. Path A is what an agent ships *today*; Path B is what raises the floor on the templates it sees over and over.
 ---
-## The HF Space UI
-The Gradio app at https://huggingface.co/spaces/Madhav189/sre-env is mounted at `/` of the same uvicorn process that serves `/reset` + `/step`. Three tiers selectable as cards (Triage live HTTP, Strategy chained-episode runner, Operations graph simulator). Every run streams per-tick action, env response, reward delta, and the 5-component breakdown — `out=… valid=… fmt=… anti=… eff=…`.
-Provider auth is whatever the user pastes (HF token plus optional Anthropic / OpenAI / Together / Fireworks / Groq / DeepSeek key). Tokens live only on the request instance — never logged, never persisted, never echoed in error messages. CSS theme is GitHub-dark phosphor on a JetBrains Mono base; see [`app.py`](app.py) for the full styling block.
----
 ## Tier-aware Python API
 ```python
@@ -268,74 +278,29 @@ Old tier names (`Tier.BASIC`, `Tier.ADVANCED`, `Tier.MAX`) are preserved as Enum
 ```bash
 make test            # green at HEAD
-ruff check .         # configured; pre-existing F401 cleanups tracked separately
 openenv validate .   # green
 ```
 The two CI invariants that keep the rubric calibrated:
 - `test_heuristic_ceiling_is_in_band` — naive heuristic in `[0.65, 0.80]` on every template.
-- `test_round2_baseline_resolves` — scripted-optimal `≥ 0.90` on the 6 round-2 templates.
-Tighten either side and the gradient signal collapses; loosen either side and a memorising model can game the rubric without learning causality. The band is the load-bearing engineering claim.
----
-## Architecture
-```
-┌──────────────────────────────────────────────────────────────┐
-│  app.py  (uvicorn app:app on port 7860)                      │
-│   ├─ Gradio terminal UI mounted at /                         │
-│   └─ FastAPI server (unified_incident_env.server.app)        │
-│       ├─ /reset /step /state         OpenEnv contract        │
-│       ├─ /tasks /baseline /grader    catalogue + scoring     │
-│       ├─ /status /health             ops probes              │
-│       ├─ /metadata /schema           OpenEnv metadata        │
-│       ├─ /mcp                        JSON-RPC 2.0 dual-route │
-│       ├─ /docs /redoc /openapi.json  Swagger / ReDoc         │
-│       └─ /info /simple               legacy markdown landing │
-│                                                              │
-│  sre_gym/                                                    │
-│   ├─ tier.py                Tier enum + TierConfig           │
-│   ├─ env.py                 SREGym factory (delegates per t.)│
-│   ├─ basic_runner.py        wrap UnifiedIncidentEnvironment  │
-│   ├─ strategy/runner.py     chain Triage episodes + horizon  │
-│   ├─ operations/runner.py   Python state-machine over 22 nd. │
-│   ├─ ui/                    providers, router, policies      │
-│   ├─ local.py               in-process CLI for Ollama models │
-│   └─ exceptions.py          typed errors                     │
-│                                                              │
-│  coliseum/                  parallel-rollout pool server     │
-│   ├─ server.py              FastAPI lease pool               │
-│   └─ client.py              ArenaClient + create_arena_client│
-│                                                              │
-│  notebooks/                                                  │
-│   └─ 01_triage_train_grpo_qwen25_7b.ipynb   SFT → GRPO pipe. │
-│                                                              │
-│  skill/                     Claude Code skill (Path A)       │
-│   ├─ SKILL.md               agent instructions               │
-│   ├─ tools/                 sre-gym HTTP client              │
-│   └─ verified-runbooks/     12 per-template runbooks         │
-└──────────────────────────────────────────────────────────────┘
-```
-Per-tier deep dives in [`docs/TRIAGE_TIER.md`](docs/TRIAGE_TIER.md) / [`docs/STRATEGY_TIER.md`](docs/STRATEGY_TIER.md) / [`docs/OPERATIONS_TIER.md`](docs/OPERATIONS_TIER.md). Reward design: [`docs/REWARD_DESIGN.md`](docs/REWARD_DESIGN.md). Operator guide: [`execution.md`](execution.md). Architectural narrative: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md). Blog: [`BLOG.md`](BLOG.md).
 ---
 ## Materials
-- [`openenv.yaml`](openenv.yaml) — declares the three tiers, runnable kinds, scenario counts.
-- [`pyproject.toml`](pyproject.toml) — Python package, deps, entry points.
-- [`docs/`](docs/) — architecture, per-tier deep dives, reward design, scenario authoring guide, references.
-- [`docs/blog/`](docs/blog/) — the 6 blog assets (hero, topology, rubric donut, chaos timeline, two-paths, baselines bar).
-- [`skill/`](skill/) — Claude Code skill packaging (Path A).
-- [`coliseum/`](coliseum/) — parallel-rollout pool server.
-- [`demo/`](demo/) — `run_demo.sh` end-to-end demo, `pitch.md` narrative.
-- [`eval/`](eval/) — held-out split definition, results directory.
-- [`train/data/`](train/data/) — teacher trajectories (Claude Opus + Llama-3.3-70B + scripted baselines + 120-episode v2 corpus).
-- [`notebooks/`](notebooks/) — Triage SFT→GRPO training (`01_triage_train_grpo_qwen25_7b.ipynb`), eval comparison (`02_triage_eval_compare_all.ipynb`), Strategy + Operations walkthroughs.
 ---

 ---
+title: SystemTruth
 emoji: 🚨
 colorFrom: red
 colorTo: yellow
 license: apache-2.0
 ---
+# SystemTruth — a tier-escalating SRE training environment
 > **Hackathon submission — OpenEnv-class, India 2026**
 >
 > - 📖 **Blog:** [BLOG.md](BLOG.md)
+> - 🚀 **Live HF Space:** https://huggingface.co/spaces/Madhav189/SystemTruth
+> - 💻 **GitHub:** https://github.com/Madhav-GPT/SystemTruth
+> - 🧪 **Training notebook (Colab):** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu?usp=sharing) — same as [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb)
 > - 📊 **Eval results:** [`eval/results/`](eval/results/)
 > - 📜 **License:** Apache 2.0
 **Each tier escalates a different dimension. Triage escalates compute, Strategy escalates horizon, Operations escalates realism.** That single sentence is the load-bearing claim of the project.
 ---
+## What's in the box (the USP — read this first)
+SystemTruth is **one runnable RL environment with three personas baked into it**. The same 11-action contract, the same 5-component reward rubric, the same termination shape — escalated along three orthogonal axes that map to the three real bottlenecks SRE-agent training loops actually hit.
+![SystemTruth architecture — three tiers under one shared 11-action interface + 5-component rubric](docs/blog/system_architecture.png)
+### One environment, three tiers, three different bottlenecks
+| Tier | Bottleneck | Persona | What it teaches |
+|---|---|---|---|
+| **Triage** | **Compute** | ML student / Kaggle, $30 of HF credits | causal mapping under tight context — pre-digested observations, dense reward shaping, 8K context, 11-action space, 8–13 ticks per episode |
+| **Strategy** | **Horizon** | Seed-stage startup, $300–500 budget | long-horizon planning across chained incidents — multi-incident chains with persistent state, unresolved alerts and pending deploys carry forward, 60–90 ticks |
+| **Operations** | **Realism** | Enterprise SRE platform, 8×A100/H100 cluster | authentic tool use against irreversible actions — 22-node service graph, 11 chaos patterns pinned to real production post-mortems, 110–180+ actions per episode |
+The escalation axis is the entire pitch. Most RL environments stratify by *difficulty* (more scenarios, longer episodes, harder rewards). SystemTruth stratifies by **the dimension that actually limits the training loop for that persona**:
+- A junior on-call learning to triage faces a different problem (cognitive efficiency under tight context) than a senior SRE running a multi-incident postmortem (state tracking across long horizons), which faces a different problem from an enterprise platform team operating against an actively chaos-engineered cluster (irreversible actions, partial observability, real wall-clock).
+- Their training signals, episode shapes, observation richness, and reward structures should not look the same.
+- SystemTruth takes that observation seriously and stratifies its tiers along *the dimension that actually limits the persona's training loop*.
+### The shared 11-action contract
+Every tier — Triage, Strategy, Operations — speaks the same eleven Pydantic-validated actions. **One contract, three escalation envelopes:**
 ```
 query_logs(service)            query_metrics(service, metric)
 declare_resolved
 ```
+A successful episode is `gather evidence → submit_hypothesis → rollback_deploy → restart_service → both run_check pass → declare_resolved`. Wrong rollback target, premature restart, or premature resolution all return negative reward and a typed `failure_type`. The contract refuses to be gamed.
+### The episode lifecycle, illustrated
+The lifecycle below is the Triage tier in detail; Strategy chains N of them with horizon-decay, Operations runs one of them inside a graph-mutation simulator. **The shape is shared across all three tiers** — the simulator under it is what changes.
+![SystemTruth episode lifecycle — Triage tier, same shape inherited by Strategy and Operations](docs/blog/episode_lifecycle.png)
+Eleven numbered stages, each producing a measurable signal:
+1. **`reset(scenario_id)`** — env emits the initial observation: tick counter, workflow stage, incident summary, active alerts, noise alerts (decoys), service health (cpu/mem/err/latency), user impact, SLO burn rate, checks, allowed actions.
+2. **Evidence gathering loop** — agent calls `query_logs / query_metrics / query_dependencies / query_deploys`. After every step the env computes a per-tick **shaped reward** as a potential difference (`Δ critical_service_health × 0.55 + Δ (1 − user_impact) × 0.20 + Δ (1 − slo_burn_rate) × 0.15 + containment_applied × 0.10`) minus `step_cost`, plus `bonus`, minus `penalty`.
+3. **`submit_hypothesis(root_cause, affected_services, confidence, recommended_next_action)`** — the world-modelling primitive. Confidence is a `float ∈ [0,1]` the agent must commit to.
+4. **Hypothesis correctness check** — if the root cause matches truth, the agent gets an in-episode bonus up to ~0.12 (idempotent — second identical hypothesis scores 0). If wrong, the agent loops back to investigation with a new observation.
+5. **`rollback_deploy(service)`** — the irreversible action. Wrong target = `unsafe_action_penalty` (0.08 medium / 0.12 hard). Correct target sets `cause_removed = True` and unblocks restart.
+6. **`restart_service(service)`** — only valid if scenario requires it. Guard: if cause not removed, premature-restart penalty fires and state re-inherits the bad config.
+7. **`run_check("end_to_end" | "database_recovery")`** — verification gate. If checks fail, agent loops back to investigation.
+8. **`declare_resolved`** — terminal action. Guard: if checks not passed, `premature_resolution_penalty` (0.20 / 0.30) fires.
+9. **Episode terminates** — terminal state emitted.
+10. **Compute composite from terminal state** — the 5-component rubric below evaluates outcome / action_validity / format / anticheat / efficiency, sums to 1.0 with weighted clamping to `[0.01, 0.99]`.
+11. **Reference scores anchor the rubric** — random `0.417` (0/36 resolved), naive heuristic `0.749` (0/12 resolved), scripted-optimal `0.938` (12/12 resolved). The 0.20 gap from `0.80 → 1.00` is what GRPO trains into.
+**Cross-tier extension:**
+- **Strategy** chains N Triage episodes, applies a `horizon_decay_factor × mean(per-phase composite)` to the final reward.
+- **Operations** runs the same lifecycle inside a graph-mutation simulator over a 22-node service topology, same rubric, same horizon-decay weighting.
+### Reward rubric — the engineering crown jewel
 ```
 final_reward = 0.45·outcome
   efficiency       exp(-current_tick / optimal_ticks_for_template)
 ```
+Each component defends against a specific cheat:
 | Cheat strategy | Blocked by |
 |---|---|
    -1.0   confident-wrong
 ```
+The `confidence ∈ [0,1]` field is part of the structured `HypothesisPayload` Pydantic model the agent emits; the calibration sub-term reads it directly. **A model that bluffs high confidence on a wrong root cause is worse than one that hedges.** That's the world-modelling primitive — the env grades the agent's belief, not just its prediction.
+Two CI invariants pin the rubric in place on every commit:
+- **Heuristic ceiling `[0.65, 0.80]`** — `test_heuristic_ceiling_is_in_band` enforces this band on every template. The 0.20 gap from 0.80 → 1.00 is the GRPO training target.
+- **Scripted-expert floor `≥0.90`** — `test_round2_baseline_resolves` enforces ≥0.90 on the round-2 templates.
+Tighten either side and the gradient signal collapses; loosen either side and a memorising model can game the rubric. The band is the load-bearing engineering claim.
 ---
 ## Coliseum — parallel-rollout pool server
+[`coliseum/`](coliseum/) wraps the Triage env in a lease-based HTTP contract so a GRPO trainer's parallel-rollout side can drive the env without holding an in-process Python instance per worker:
 ```
 allocate(task_key)                    -> {ok: true, lease_id}
 ## Training & datasets — the honest weak point
+The environment is the project. Training scripts orbit around it. We ran a real end-to-end Triage run on the day of the deadline — pipeline works, results are below the heuristic plateau, the gap is real, and we're saying so.
 ### What we ran
+Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb) — also openable in Colab via the badge: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu?usp=sharing). Target A100 80GB, ~2-3h end-to-end:
 1. **SFT cold-start** — Unsloth + Qwen2.5-7B-Instruct (4-bit) + LoRA r=32, 50 steps × batch 16 on a 999-example diverse trajectory corpus. Eval perplexity 1.755 (healthy `[1.5, 3.0]` band). Saved to `outputs/qwen25_7b_sft_final/`.
 2. **GRPO online** — TRL's `GRPOTrainer`, 40 steps × K=2 rollouts. Reward = composite + scenario-aware first-action bonus. Saved to `outputs/qwen25_7b_grpo_final/`.
 ### What it produced
+![SystemTruth Triage holdout eval — Qwen2.5-7B, 12 scenarios × 3 seeds](eval/results/qwen25_7b_comparison_hero.png)
 | policy | mean | median | p25 | p75 | resolved_rate |
 |---|---|---|---|---|---|
 | random | 0.342 | 0.378 | 0.340 | 0.380 | 0/36 |
 | heuristic (queries + correct hypothesis, no remediation) | 0.704 | 0.705 | 0.703 | 0.705 | 0/36 |
 | scripted-optimal | 0.938 | 0.939 | 0.937 | 0.940 | 36/36 |
+![Per-template mean score by policy](eval/results/qwen25_7b_comparison_per_template.png)
 Honest reading:
 - SFT lifted the model 11% above random (0.342 → 0.379). Format-learning worked: the trained model produces 100% schema-valid `action_type` JSON.
 - GRPO did not move the mean further on K=2 / 40 steps / 7B. The advantage signal exists but the budget is too small to overcome the heuristic plateau at 0.704.
 ---
+## Quickstart
+### 5-minute local demo (no API keys, no server, no GPU)
+```bash
+pip install -e .
+ollama pull llama3.2
+python -m sre_gym.local triage worker_deploy_cascade
+```
+The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary.
+### Live HF Space (Triage tier, hosted)
+Open https://huggingface.co/spaces/Madhav189/SystemTruth. Pick a tier, paste an HF token, click **▶ run eval**. Each tick streams the action, env response, reward delta, and the 5-component breakdown.
+### Local server + Gradio UI
+```bash
+make install
+make dev                                              # FastAPI + Gradio on :7860
+python -m sre_gym.strategy run cascading_release_train
+python -m sre_gym.operations run ecommerce_vibecoded_saas --chaos rls_silent_leak
+```
+The FastAPI server speaks the OpenEnv contract (`/reset /step /state /tasks /baseline /grader /status /health /metadata /schema`) plus an MCP JSON-RPC route at `/mcp`.
+---
 ## Two-paths agent design
 The repo ships **two independent paths** to a working SRE agent. They share the env contract but trade compute for capability differently.
 bash demo/run_demo.sh                                # end-to-end demo
 ```
 ### Path B — GRPO-trained adapter (one A100, ~2–3h on 7B)
 The training pipeline above. Path A is what an agent ships *today*; Path B is what raises the floor on the templates it sees over and over.
 ---
 ## Tier-aware Python API
 ```python
 ```bash
 make test            # green at HEAD
+ruff check .
 openenv validate .   # green
 ```
 The two CI invariants that keep the rubric calibrated:
 - `test_heuristic_ceiling_is_in_band` — naive heuristic in `[0.65, 0.80]` on every template.
+- `test_round2_baseline_resolves` — scripted-optimal `≥ 0.90` on the round-2 templates.
 ---
 ## Materials
+- [`BLOG.md`](BLOG.md) — the hackathon blog (with all 6 assets in `docs/blog/`)
+- [`openenv.yaml`](openenv.yaml) — declares the three tiers, runnable kinds, scenario counts
+- [`docs/`](docs/) — architecture, per-tier deep dives, reward design, scenario authoring
+- [`docs/blog/`](docs/blog/) — visuals: lifecycle, architecture, hero, topology, rubric donut, chaos timeline, two-paths, baselines bar
+- [`skill/`](skill/) — Claude Code skill packaging (Path A)
+- [`coliseum/`](coliseum/) — parallel-rollout pool server
+- [`demo/`](demo/) — `run_demo.sh` end-to-end demo, `pitch.md` narrative
+- [`eval/`](eval/) — held-out split definition, results directory with the latest eval CSV + plots
+- [`train/data/`](train/data/) — teacher trajectories (Claude Opus + Llama-3.3-70B + scripted baselines + 120-episode v2 corpus)
+- [`notebooks/`](notebooks/) — Triage SFT→GRPO training (`01_triage_train_grpo_qwen25_7b.ipynb`), eval comparison (`02_triage_eval_compare_all.ipynb`), Strategy + Operations walkthroughs
 ---

app.py CHANGED Viewed

@@ -84,9 +84,9 @@ TIER_DEFAULT_MODEL: dict[str, str] = {
 TIER_DESCRIPTION: dict[str, str] = {
-    "basic":    "escalates compute · 12 templates × 5 procgen variants · single bounded incident",
-    "advanced": "escalates horizon · chained incidents · persistent state across episodes",
-    "max":      "escalates realism · 22-service ecommerce sim · 11 chaos patterns",
 }
@@ -334,36 +334,41 @@ gradio-app::before {
 }
 .sg-panel-label::before { content: '▸'; color: var(--brand); }
-/* ─── INPUTS — token / model / provider key ──────────────────────────── */
 .sg-panel-col .form, .sg-panel-col .block { background: transparent !important; }
 .sg-panel-col input,
 .sg-panel-col textarea,
 .sg-panel-col select {
-  background: var(--bg-input) !important;
-  border: 1px solid var(--border) !important;
   color: var(--text-primary) !important;
-  font-family: var(--mono) !important; font-size: 12px !important;
-  padding: 8px 10px !important; border-radius: 0 !important;
   box-shadow: none !important;
 }
 .sg-panel-col input:focus,
 .sg-panel-col textarea:focus,
 .sg-panel-col select:focus {
-  border-color: var(--action) !important; outline: none !important;
 }
 .sg-panel-col input::placeholder, .sg-panel-col textarea::placeholder {
-  color: var(--text-faint) !important;
 }
 /* Field labels — Gradio renders <label><span>LABEL</span> ...</label> */
 .sg-panel-col label > span:first-child,
 .sg-panel-col .label-wrap > span,
 .sg-panel-col .label-wrap span {
   color: var(--text-secondary) !important;
-  font-size: 10px !important;
-  letter-spacing: 0.12em !important;
   text-transform: uppercase !important;
-  font-weight: 500 !important;
-  margin-bottom: 5px !important;
 }
 .sg-panel-col label { background: transparent !important; }
@@ -371,16 +376,17 @@ gradio-app::before {
 .sg-panel-col .dropdown,
 .sg-panel-col .wrap-inner,
 .sg-panel-col .options {
-  background: var(--bg-input) !important;
-  border: 1px solid var(--border) !important;
   color: var(--text-primary) !important;
 }
 .sg-panel-col .dropdown ul li:hover,
 .sg-panel-col .options li:hover {
   background: var(--bg-input-hover) !important;
 }
-/* ─── TIER CARDS — 3 styled buttons ──────────────────────────────────── */
 .sg-tier-list, .sg-tier-list .form, .sg-tier-list .gap {
   display: flex !important; flex-direction: column !important; gap: 8px !important;
   background: transparent !important;
@@ -389,14 +395,15 @@ gradio-app::before {
 .sg-tier-card button {
   display: block !important;
   padding: 14px 16px !important;
-  background: var(--bg-input) !important;
-  border: 1px solid var(--border) !important;
   color: var(--text-secondary) !important;
   font-family: var(--mono) !important; font-size: 11.5px !important;
   font-weight: 400 !important;
   text-align: left !important; cursor: pointer !important;
   width: 100% !important; min-height: auto !important;
-  border-radius: 0 !important; box-shadow: none !important;
   transition: all 0.15s ease !important;
   white-space: pre-line !important;
   line-height: 1.55 !important;
@@ -412,16 +419,16 @@ gradio-app::before {
   line-height: 2 !important;
 }
 .sg-tier-card button:hover {
-  background: var(--bg-input-hover) !important;
-  border-color: var(--border-strong) !important;
 }
 .sg-tier-card-selected button {
-  background: rgba(88, 166, 255, 0.06) !important;
-  border-color: var(--action) !important;
-  box-shadow: inset 2px 0 0 var(--action) !important;
 }
 .sg-tier-card-selected button::first-line {
-  color: var(--action) !important;
 }
 /* ─── TERMINAL ────────────────────────────────────────────────────────── */
@@ -455,13 +462,15 @@ gradio-app::before {
 .sg-chrome-status .em { color: var(--text-primary); font-weight: 500; }
 .sg-chrome-meta { color: var(--text-dim); font-size: 11px; }
 .sg-terminal-body {
-  padding: 18px 20px 22px;
-  font-size: 12.5px; line-height: 1.7;
   white-space: pre; overflow-x: auto;
   background: var(--bg-panel);
   background-image: linear-gradient(transparent 50%, rgba(255, 255, 255, 0.012) 50%);
   background-size: 100% 3px;
-  min-height: 480px; max-height: 64vh; overflow-y: auto;
   color: var(--text-primary);
 }
 .sg-terminal-body .ts  { color: var(--timestamp); }
@@ -480,50 +489,60 @@ gradio-app::before {
 }
 @keyframes sg-blink { 50% { opacity: 0; } }
-/* ─── CONTROLS ROW ────────────────────────────────────────────────────── */
 .sg-controls-row {
-  padding: 14px 16px !important;
   background: var(--bg-panel) !important;
   border: 1px solid var(--border) !important;
   margin-bottom: 16px !important;
-  align-items: center !important; gap: 24px !important;
 }
-.sg-btn-group { gap: 8px !important; flex-wrap: nowrap !important; }
 .sg-btn-primary, .sg-btn-secondary {
   flex: 0 0 auto !important; min-width: auto !important;
 }
 .sg-btn-primary button, .sg-btn-secondary button {
   font-family: var(--mono) !important; font-size: 12px !important;
-  font-weight: 600 !important; letter-spacing: 0.06em !important;
   text-transform: uppercase !important;
-  padding: 9px 16px !important; border-radius: 0 !important;
   box-shadow: none !important; min-height: auto !important;
   cursor: pointer !important; transition: all 0.15s ease !important;
 }
 .sg-btn-primary button {
-  background: rgba(63, 185, 80, 0.12) !important;
-  border: 1px solid var(--success) !important;
-  color: var(--success) !important;
 }
-.sg-btn-primary button:hover { background: rgba(63, 185, 80, 0.20) !important; }
 .sg-btn-secondary button {
-  background: var(--bg-input) !important;
   border: 1px solid var(--border-strong) !important;
   color: var(--text-primary) !important;
 }
 .sg-btn-secondary button:hover {
-  background: var(--bg-input-hover) !important;
   border-color: var(--border-focus) !important;
 }
-/* ─── METRICS BAR ─────────────────────────────────────────────────────── */
-/* The HTML sits inside Gradio's html-container — force flex on both. */
 .sg-metrics-host > div, .sg-metrics-host .prose { background: transparent !important; }
 .sg-metrics {
   display: flex !important; align-items: center !important;
   gap: 24px !important; flex-wrap: wrap !important;
   color: var(--text-secondary) !important; font-size: 11px !important;
-  padding: 4px 0 !important;
 }
 .sg-metric {
   display: flex !important; gap: 6px !important; align-items: center !important;
@@ -536,7 +555,7 @@ gradio-app::before {
   color: var(--text-primary) !important; font-weight: 600 !important;
 }
 .sg-metric .value.r { color: var(--reward) !important; }
-.sg-metric .value.s { color: var(--success) !important; }
 .sg-rubric {
   display: flex !important; align-items: center !important; gap: 14px !important;
   padding-left: 18px !important; margin-left: 4px !important;
@@ -558,7 +577,7 @@ gradio-app::before {
   height: 3px !important; background: var(--bg-input) !important;
   overflow: hidden !important; margin-top: 2px !important;
 }
-.sg-rubric-bar > div { height: 100% !important; background: var(--success) !important; }
 /* ─── TIER DESCRIPTION (under the cards) ──────────────────────────────── */
 .sg-tier-desc, .sg-tier-desc * {
@@ -603,17 +622,18 @@ def _header_html() -> str:
     return f"""
 <header class="sg-header">
   <div class="sg-brand-block">
-    <div class="sg-brand-mark">SRE-GYM<span>//</span></div>
     <div class="sg-brand-tagline">
       <em>tier-escalating SRE RL env</em> &nbsp;·&nbsp;
-      RLVE &nbsp;·&nbsp; {THEME_TAGLINE}
     </div>
   </div>
   <nav class="sg-nav">
     <span class="sg-status-dot">env online</span>
     <a href="/docs" target="_blank" rel="noopener">api docs</a>
     <a href="/mcp/tools" target="_blank" rel="noopener">mcp tools</a>
-    <a href="/info" target="_blank" rel="noopener">legacy</a>
   </nav>
 </header>
 """
@@ -651,11 +671,11 @@ FOOTER_HTML = """
   <div>
     built for the openenv hackathon · india apr '26
     &nbsp;·&nbsp;
-    <a href="https://github.com/Madhav-GPT/sre-env" target="_blank">github</a>
     &nbsp;·&nbsp;
-    <a href="https://huggingface.co/spaces/Madhav189/sre-env" target="_blank">hf space</a>
     &nbsp;·&nbsp;
-    <a href="https://github.com/Madhav-GPT/sre-env/blob/main/BLOG.md" target="_blank">blog</a>
   </div>
   <div>multi-rubric reward · RLVE procgen · MCP dual-route</div>
 </footer>
@@ -1231,28 +1251,29 @@ def build_app() -> gr.Blocks:
         # ── terminal pane ──────────────────────────────────────────
         terminal = gr.HTML(_initial_terminal_html(), elem_id="sg-terminal-host")
-        # ── controls + metrics row ────────────────────────────────
-        with gr.Row(elem_classes=["sg-controls-row"]):
-            with gr.Column(scale=0, min_width=280):
-                with gr.Row(elem_classes=["sg-btn-group"]):
-                    run_btn = gr.Button(
-                        "▶  RUN EVAL",
-                        variant="primary",
-                        elem_classes=["sg-btn-primary"],
-                    )
-                    stop_btn = gr.Button(
-                        "■  STOP",
-                        elem_classes=["sg-btn-secondary"],
-                    )
-                    reset_btn = gr.Button(
-                        "↻  RESET",
-                        elem_classes=["sg-btn-secondary"],
-                    )
-            with gr.Column(scale=1):
-                metrics = gr.HTML(
-                    _metric_bar_html(),
-                    elem_classes=["sg-metrics-host"],
                 )
         gr.HTML(FOOTER_HTML)

 TIER_DESCRIPTION: dict[str, str] = {
+    "basic":    "Triage tier · escalates compute · 12 templates × 5 procgen variants · single bounded incident",
+    "advanced": "Strategy tier · escalates horizon · chained incidents · persistent state across episodes",
+    "max":      "Operations tier · escalates realism · 22-service ecommerce sim · 11 chaos patterns",
 }
 }
 .sg-panel-label::before { content: '▸'; color: var(--brand); }
+/* ─── INPUTS — token / model / provider key (LIGHTER + BIGGER) ───────── */
 .sg-panel-col .form, .sg-panel-col .block { background: transparent !important; }
 .sg-panel-col input,
 .sg-panel-col textarea,
 .sg-panel-col select {
+  background: #1f2630 !important;                  /* lighter than the panel */
+  border: 1px solid var(--border-strong) !important;
   color: var(--text-primary) !important;
+  font-family: var(--mono) !important;
+  font-size: 13px !important;                      /* was 12 */
+  padding: 12px 14px !important;                   /* was 8/10 */
+  border-radius: 4px !important;                   /* was 0 — softer, more usable */
   box-shadow: none !important;
+  min-height: 42px !important;                     /* taller for usability */
 }
 .sg-panel-col input:focus,
 .sg-panel-col textarea:focus,
 .sg-panel-col select:focus {
+  border-color: var(--brand) !important;           /* phosphor accent on focus */
+  outline: none !important;
+  box-shadow: 0 0 0 1px rgba(126, 231, 135, 0.25) !important;
 }
 .sg-panel-col input::placeholder, .sg-panel-col textarea::placeholder {
+  color: var(--text-dim) !important;               /* was --text-faint */
 }
 /* Field labels — Gradio renders <label><span>LABEL</span> ...</label> */
 .sg-panel-col label > span:first-child,
 .sg-panel-col .label-wrap > span,
 .sg-panel-col .label-wrap span {
   color: var(--text-secondary) !important;
+  font-size: 11px !important;
+  letter-spacing: 0.14em !important;
   text-transform: uppercase !important;
+  font-weight: 600 !important;
+  margin-bottom: 6px !important;
 }
 .sg-panel-col label { background: transparent !important; }
 .sg-panel-col .dropdown,
 .sg-panel-col .wrap-inner,
 .sg-panel-col .options {
+  background: #1f2630 !important;
+  border: 1px solid var(--border-strong) !important;
   color: var(--text-primary) !important;
+  border-radius: 4px !important;
 }
 .sg-panel-col .dropdown ul li:hover,
 .sg-panel-col .options li:hover {
   background: var(--bg-input-hover) !important;
 }
+/* ─── TIER CARDS — 3 styled buttons (theme-cohesive phosphor accent) ─── */
 .sg-tier-list, .sg-tier-list .form, .sg-tier-list .gap {
   display: flex !important; flex-direction: column !important; gap: 8px !important;
   background: transparent !important;
 .sg-tier-card button {
   display: block !important;
   padding: 14px 16px !important;
+  background: #1f2630 !important;                  /* match input bg */
+  border: 1px solid var(--border-strong) !important;
   color: var(--text-secondary) !important;
   font-family: var(--mono) !important; font-size: 11.5px !important;
   font-weight: 400 !important;
   text-align: left !important; cursor: pointer !important;
   width: 100% !important; min-height: auto !important;
+  border-radius: 4px !important;
+  box-shadow: none !important;
   transition: all 0.15s ease !important;
   white-space: pre-line !important;
   line-height: 1.55 !important;
   line-height: 2 !important;
 }
 .sg-tier-card button:hover {
+  background: #252d38 !important;
+  border-color: var(--border-focus) !important;
 }
 .sg-tier-card-selected button {
+  background: rgba(126, 231, 135, 0.06) !important;        /* phosphor wash */
+  border-color: var(--brand) !important;
+  box-shadow: inset 3px 0 0 var(--brand) !important;
 }
 .sg-tier-card-selected button::first-line {
+  color: var(--brand) !important;                          /* matches header brand */
 }
 /* ─── TERMINAL ────────────────────────────────────────────────────────── */
 .sg-chrome-status .em { color: var(--text-primary); font-weight: 500; }
 .sg-chrome-meta { color: var(--text-dim); font-size: 11px; }
 .sg-terminal-body {
+  padding: 16px 20px 18px;
+  font-size: 12.5px; line-height: 1.65;
   white-space: pre; overflow-x: auto;
   background: var(--bg-panel);
   background-image: linear-gradient(transparent 50%, rgba(255, 255, 255, 0.012) 50%);
   background-size: 100% 3px;
+  min-height: 280px;                    /* was 480 — visible above the fold */
+  max-height: 56vh;                     /* still scrolls if a long run */
+  overflow-y: auto;
   color: var(--text-primary);
 }
 .sg-terminal-body .ts  { color: var(--timestamp); }
 }
 @keyframes sg-blink { 50% { opacity: 0; } }
+/* ─── CONTROLS ROW — stacks vertically: buttons on top, metrics below ── */
+/* Now a gr.Column wrapped with this class — Gradio gives us flex-direction:
+   column for free, but we still pin it for browsers that style differently. */
 .sg-controls-row {
+  padding: 16px 18px !important;
   background: var(--bg-panel) !important;
   border: 1px solid var(--border) !important;
   margin-bottom: 16px !important;
+  display: flex !important;
+  flex-direction: column !important;
+  gap: 14px !important;
+  align-items: stretch !important;
+}
+.sg-btn-group {
+  gap: 10px !important;
+  flex-wrap: wrap !important;                    /* on narrow screens buttons wrap rather than overflow */
+  justify-content: flex-start !important;
 }
 .sg-btn-primary, .sg-btn-secondary {
   flex: 0 0 auto !important; min-width: auto !important;
 }
 .sg-btn-primary button, .sg-btn-secondary button {
   font-family: var(--mono) !important; font-size: 12px !important;
+  font-weight: 700 !important; letter-spacing: 0.08em !important;
   text-transform: uppercase !important;
+  padding: 11px 22px !important;                 /* a touch bigger so it stands alone on its row */
+  border-radius: 4px !important;
   box-shadow: none !important; min-height: auto !important;
   cursor: pointer !important; transition: all 0.15s ease !important;
 }
 .sg-btn-primary button {
+  background: rgba(126, 231, 135, 0.10) !important;
+  border: 1px solid var(--brand) !important;
+  color: var(--brand) !important;
 }
+.sg-btn-primary button:hover { background: rgba(126, 231, 135, 0.18) !important; }
 .sg-btn-secondary button {
+  background: #1f2630 !important;
   border: 1px solid var(--border-strong) !important;
   color: var(--text-primary) !important;
 }
 .sg-btn-secondary button:hover {
+  background: #252d38 !important;
   border-color: var(--border-focus) !important;
 }
+/* ─── METRICS BAR (now sits under the run buttons) ───────────────────── */
+.sg-metrics-host { padding-top: 8px !important; border-top: 1px solid var(--border) !important; }
 .sg-metrics-host > div, .sg-metrics-host .prose { background: transparent !important; }
 .sg-metrics {
   display: flex !important; align-items: center !important;
   gap: 24px !important; flex-wrap: wrap !important;
   color: var(--text-secondary) !important; font-size: 11px !important;
+  padding: 6px 0 0 !important;
 }
 .sg-metric {
   display: flex !important; gap: 6px !important; align-items: center !important;
   color: var(--text-primary) !important; font-weight: 600 !important;
 }
 .sg-metric .value.r { color: var(--reward) !important; }
+.sg-metric .value.s { color: var(--brand) !important; }     /* phosphor — theme cohesion */
 .sg-rubric {
   display: flex !important; align-items: center !important; gap: 14px !important;
   padding-left: 18px !important; margin-left: 4px !important;
   height: 3px !important; background: var(--bg-input) !important;
   overflow: hidden !important; margin-top: 2px !important;
 }
+.sg-rubric-bar > div { height: 100% !important; background: var(--brand) !important; }
 /* ─── TIER DESCRIPTION (under the cards) ──────────────────────────────── */
 .sg-tier-desc, .sg-tier-desc * {
     return f"""
 <header class="sg-header">
   <div class="sg-brand-block">
+    <div class="sg-brand-mark">SystemTruth<span>//</span></div>
     <div class="sg-brand-tagline">
       <em>tier-escalating SRE RL env</em> &nbsp;·&nbsp;
+      Triage / Strategy / Operations &nbsp;·&nbsp; {THEME_TAGLINE}
     </div>
   </div>
   <nav class="sg-nav">
     <span class="sg-status-dot">env online</span>
     <a href="/docs" target="_blank" rel="noopener">api docs</a>
     <a href="/mcp/tools" target="_blank" rel="noopener">mcp tools</a>
+    <a href="https://github.com/Madhav-GPT/SystemTruth" target="_blank" rel="noopener">github</a>
+    <a href="https://github.com/Madhav-GPT/SystemTruth/blob/main/BLOG.md" target="_blank" rel="noopener">blog</a>
   </nav>
 </header>
 """
   <div>
     built for the openenv hackathon · india apr '26
     &nbsp;·&nbsp;
+    <a href="https://github.com/Madhav-GPT/SystemTruth" target="_blank">github</a>
     &nbsp;·&nbsp;
+    <a href="https://huggingface.co/spaces/Madhav189/SystemTruth" target="_blank">hf space</a>
     &nbsp;·&nbsp;
+    <a href="https://github.com/Madhav-GPT/SystemTruth/blob/main/BLOG.md" target="_blank">blog</a>
   </div>
   <div>multi-rubric reward · RLVE procgen · MCP dual-route</div>
 </footer>
         # ── terminal pane ──────────────────────────────────────────
         terminal = gr.HTML(_initial_terminal_html(), elem_id="sg-terminal-host")
+        # ── controls + metrics — stacked vertically (buttons on top, ──
+        #    metrics below). Using a single Column with two children means
+        #    the metrics bar gets the full width on its own row instead of
+        #    fighting the buttons for horizontal space.
+        with gr.Column(elem_classes=["sg-controls-row"]):
+            with gr.Row(elem_classes=["sg-btn-group"]):
+                run_btn = gr.Button(
+                    "▶  RUN EVAL",
+                    variant="primary",
+                    elem_classes=["sg-btn-primary"],
+                )
+                stop_btn = gr.Button(
+                    "■  STOP",
+                    elem_classes=["sg-btn-secondary"],
                 )
+                reset_btn = gr.Button(
+                    "↻  RESET",
+                    elem_classes=["sg-btn-secondary"],
+                )
+            metrics = gr.HTML(
+                _metric_bar_html(),
+                elem_classes=["sg-metrics-host"],
+            )
         gr.HTML(FOOTER_HTML)

docs/TRIAGE_TIER.md CHANGED Viewed

@@ -179,7 +179,7 @@ End-to-end: ~2-3h on a single A100 80GB, ~$5–8 of HF compute credits.
 - ✅ 12 templates × 5 procgen variants = 72 scenarios live
 - ✅ Pytest suite green
 - ✅ `openenv validate .` green
-- ✅ HF Space deployed at `Madhav189-sre-env.hf.space`
 - ✅ Training notebook runs end-to-end on A100 80GB
 - ✅ Eval comparison cell produces hero bar + per-template chart
 - ✅ Trained Qwen2.5-7B adapter saved to `outputs/qwen25_7b_grpo_final/`

 - ✅ 12 templates × 5 procgen variants = 72 scenarios live
 - ✅ Pytest suite green
 - ✅ `openenv validate .` green
+- ✅ HF Space deployed at `Madhav189-SystemTruth.hf.space`
 - ✅ Training notebook runs end-to-end on A100 80GB
 - ✅ Eval comparison cell produces hero bar + per-template chart
 - ✅ Trained Qwen2.5-7B adapter saved to `outputs/qwen25_7b_grpo_final/`

docs/blog/episode_lifecycle.png ADDED Viewed

Git LFS Details

SHA256: c2c6cb6f943bec6f05e710a392a522c16fe0ef6d07dd0b28e6e47b5ebe8d8342
Pointer size: 132 Bytes
Size of remote file: 1.65 MB

docs/blog/system_architecture.png ADDED Viewed

Git LFS Details

SHA256: 3053dde32d4b31c2ce8f1bded635c48ce27e9ed38c78d0733d74d1617193f8bb
Pointer size: 132 Bytes
Size of remote file: 1.8 MB

eval/results/qwen25_7b_comparison_hero.png ADDED Viewed

Git LFS Details

SHA256: 024960f21e6eaa4ad726bfc18fe7289d587dfdf12d2b02fd31fe2c4f8faaab03
Pointer size: 130 Bytes
Size of remote file: 68.4 kB

eval/results/qwen25_7b_comparison_per_template.png ADDED Viewed

Git LFS Details

SHA256: 11c9aae6e8ee1bbc8653c6a7f50971adb628e688ab4883647828c2f4f80c1706
Pointer size: 131 Bytes
Size of remote file: 104 kB

execution.md CHANGED Viewed

@@ -67,7 +67,7 @@ The honest framing: **the env is the project, the rubric is the engineering crow
 ## 2. Local setup
 ```bash
-git clone https://github.com/Madhav-GPT/sre-env.git
 cd sre-env
 python3 -m venv .venv
@@ -280,7 +280,7 @@ app_port: 7860
 ```bash
 # One-time: add the HF Space as a git remote
-git remote add hf https://huggingface.co/spaces/Madhav189/sre-env
 # Push (HF prompts for token if not cached)
 git push hf main
@@ -340,8 +340,8 @@ bash demo/run_demo.sh
 ## 12. Submission checklist
-- [x] Repo public on GitHub: https://github.com/Madhav-GPT/sre-env
-- [x] HF Space live: https://huggingface.co/spaces/Madhav189/sre-env
 - [x] BLOG.md at repo root
 - [x] 6 blog assets in `docs/blog/`
 - [x] Training notebook executed end-to-end, results in `eval/results/`

 ## 2. Local setup
 ```bash
+git clone https://github.com/Madhav-GPT/SystemTruth.git
 cd sre-env
 python3 -m venv .venv
 ```bash
 # One-time: add the HF Space as a git remote
+git remote add hf https://huggingface.co/spaces/Madhav189/SystemTruth
 # Push (HF prompts for token if not cached)
 git push hf main
 ## 12. Submission checklist
+- [x] Repo public on GitHub: https://github.com/Madhav-GPT/SystemTruth
+- [x] HF Space live: https://huggingface.co/spaces/Madhav189/SystemTruth
 - [x] BLOG.md at repo root
 - [x] 6 blog assets in `docs/blog/`
 - [x] Training notebook executed end-to-end, results in `eval/results/`

openenv.yaml CHANGED Viewed

@@ -116,7 +116,7 @@ training:
     templates_covered: 12                 # all 12 Triage templates have ≥5 episodes
 huggingface:
-  space_id: Madhav189/sre-env             # canonical HF Space
-  github_repo: Madhav-GPT/sre-env         # canonical GitHub
   sdk: docker
   hardware: cpu-basic

     templates_covered: 12                 # all 12 Triage templates have ≥5 episodes
 huggingface:
+  space_id: Madhav189/SystemTruth             # canonical HF Space
+  github_repo: Madhav-GPT/SystemTruth         # canonical GitHub
   sdk: docker
   hardware: cpu-basic