Madhav189 commited on
Commit
6583a07
·
1 Parent(s): e8774c9

SystemTruth rebrand: bigger UI, new diagrams, theme-cohesive HF Space

Browse files

- Rebrand sre-env -> SystemTruth across user-facing surfaces:
* GitHub: github.com/Madhav-GPT/SystemTruth
* HF Space: huggingface.co/spaces/Madhav189/SystemTruth
* Live URL: Madhav189-SystemTruth.hf.space
* HF Space header brand mark + tagline
Internal package name (sre_gym/) is preserved as backwards-compat;
the user-facing surface is fully renamed.

- README rewrite: much bigger "what's in the box" section explaining
the three-tier USP (Triage compute / Strategy horizon / Operations
realism). Embeds the 4 visual assets:
* docs/blog/system_architecture.png (architecture diagram)
* docs/blog/episode_lifecycle.png (lifecycle diagram)
* eval/results/qwen25_7b_comparison_hero.png (hero bar)
* eval/results/qwen25_7b_comparison_per_template.png (per-template)
Adds the Colab badge link to
colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu
for the training notebook.

- HF Space UI fixes per user critique:
* Inputs lighter (#1f2630, was bg-input #161b22) and bigger
(padding 12px 14px, font-size 13px, min-height 42px, border-radius
4px) for usability.
* Tier cards now use phosphor-green accent (matches header brand mark)
instead of out-of-place blue. New text reads
"Triage / Strategy / Operations" instead of "Basic / Advanced / Max".
* Controls row now stacks vertically: run/stop/reset buttons on top,
metrics + rubric bar BELOW (was side-by-side, cramped).
* Terminal min-height reduced 480 -> 280 so it's visible above the
fold without scrolling on a typical laptop viewport.
* RUN EVAL button uses brand phosphor green (was generic Gradio
success-green); rubric bars + resolved-rate value also use the
brand color for theme cohesion.

- BLOG.md title updated: "sre-gym" -> "SystemTruth", body refs follow.

- execution.md: clone URL points at Madhav-GPT/SystemTruth.

- openenv.yaml: huggingface.space_id and github_repo updated.

- Eval result PNGs (qwen25_7b_comparison_hero + per_template)
added to git via .gitignore exception so the README + BLOG
references resolve without a separate fetch.

.gitignore CHANGED
@@ -29,3 +29,6 @@ eval/results/*.png
29
  eval/results/*.jsonl
30
  !eval/results/.gitkeep
31
  !eval/results/README.md
 
 
 
 
29
  eval/results/*.jsonl
30
  !eval/results/.gitkeep
31
  !eval/results/README.md
32
+ # Exception: keep the canonical charts referenced from README/BLOG
33
+ !eval/results/qwen25_7b_comparison_hero.png
34
+ !eval/results/qwen25_7b_comparison_per_template.png
BLOG.md CHANGED
@@ -1,12 +1,12 @@
1
  ---
2
- title: "sre-gym — three tiers of SRE incident-response, one rubric that won't let you fake it"
3
  thumbnail: docs/blog/hero_three_tiers.png
4
  authors:
5
  - user: Madhav189
6
  - user: dakshdoesdev
7
  ---
8
 
9
- # sre-gym — three tiers of SRE incident-response, one rubric that won't let you fake it
10
 
11
  **TL;DR**
12
 
@@ -18,7 +18,7 @@ authors:
18
 
19
  ## Why this matters (read this first)
20
 
21
- Calibrated incident-response is the capability gap. Every general-purpose LLM is bad at it: they hallucinate confident root causes, over-trust the loudest signal, skip verification, and declare incidents resolved before checking anything. Those failure modes are invisible in chat demos and catastrophic in production. **sre-gym makes them legible enough to measure, then small enough to fix** — and exposes the env via the OpenEnv contract so any RL stack can train against it.
22
 
23
  We treat incident-response as a small **world-modelling** problem: the agent has to maintain a hidden-state estimate of which service is actually broken, update it from noisy observations, and commit to irreversible actions under uncertainty. The 5-component rubric grades the *mechanical signature* of that loop — evidence first, hypothesis with calibrated confidence, remediation, verification, only then resolution — instead of rewarding output that merely looks right.
24
 
@@ -37,7 +37,7 @@ ollama pull llama3.2
37
  python -m sre_gym.local triage worker_deploy_cascade
38
  ```
39
 
40
- The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary. Same code path as the HF Space at `https://huggingface.co/spaces/Madhav189/sre-env` — just without the Gradio UI in front of it.
41
 
42
  ## Three tiers, three bottlenecks
43
 
@@ -197,12 +197,12 @@ Reading post-mortems, not blog posts. Fly.io's gossip-protocol deadlock from Oct
197
 
198
  The Triage env is live. Pick a scenario, pick a model provider, watch each tick stream the action, env response, reward delta, and rubric breakdown.
199
 
200
- <iframe src="https://Madhav189-sre-env.hf.space" frameborder="0" width="100%" height="800"></iframe>
201
 
202
  For the per-tier deep dives: [`docs/TRIAGE_TIER.md`](docs/TRIAGE_TIER.md) · [`docs/STRATEGY_TIER.md`](docs/STRATEGY_TIER.md) · [`docs/OPERATIONS_TIER.md`](docs/OPERATIONS_TIER.md). For the rubric defense: [`docs/REWARD_DESIGN.md`](docs/REWARD_DESIGN.md). For the architectural narrative: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md). For the operator guide: [`execution.md`](execution.md). The training notebook lives at [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb).
203
 
204
  ## The claim
205
 
206
- sre-gym is the first SRE training environment that grades calibrated confidence as a first-class signal. The rubric tells you exactly where your model is bluffing — to two decimal places, on every commit, with a CI invariant that fails the build if the heuristic ceiling drifts out of band. Train against it and the hidden-state estimate inside your model gets sharper episode by episode. Skip the rubric and your agent stays a chat-window demo.
207
 
208
  Built for the OpenEnv-class hackathon, India 2026 — by the Madhav-GPT / dakshdoesdev team. Apache 2.0.
 
1
  ---
2
+ title: "SystemTruth — three tiers of SRE incident-response, one rubric that won't let you fake it"
3
  thumbnail: docs/blog/hero_three_tiers.png
4
  authors:
5
  - user: Madhav189
6
  - user: dakshdoesdev
7
  ---
8
 
9
+ # SystemTruth — three tiers of SRE incident-response, one rubric that won't let you fake it
10
 
11
  **TL;DR**
12
 
 
18
 
19
  ## Why this matters (read this first)
20
 
21
+ Calibrated incident-response is the capability gap. Every general-purpose LLM is bad at it: they hallucinate confident root causes, over-trust the loudest signal, skip verification, and declare incidents resolved before checking anything. Those failure modes are invisible in chat demos and catastrophic in production. **SystemTruth makes them legible enough to measure, then small enough to fix** — and exposes the env via the OpenEnv contract so any RL stack can train against it.
22
 
23
  We treat incident-response as a small **world-modelling** problem: the agent has to maintain a hidden-state estimate of which service is actually broken, update it from noisy observations, and commit to irreversible actions under uncertainty. The 5-component rubric grades the *mechanical signature* of that loop — evidence first, hypothesis with calibrated confidence, remediation, verification, only then resolution — instead of rewarding output that merely looks right.
24
 
 
37
  python -m sre_gym.local triage worker_deploy_cascade
38
  ```
39
 
40
+ The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary. Same code path as the HF Space at `https://huggingface.co/spaces/Madhav189/SystemTruth` — just without the Gradio UI in front of it.
41
 
42
  ## Three tiers, three bottlenecks
43
 
 
197
 
198
  The Triage env is live. Pick a scenario, pick a model provider, watch each tick stream the action, env response, reward delta, and rubric breakdown.
199
 
200
+ <iframe src="https://Madhav189-SystemTruth.hf.space" frameborder="0" width="100%" height="800"></iframe>
201
 
202
  For the per-tier deep dives: [`docs/TRIAGE_TIER.md`](docs/TRIAGE_TIER.md) · [`docs/STRATEGY_TIER.md`](docs/STRATEGY_TIER.md) · [`docs/OPERATIONS_TIER.md`](docs/OPERATIONS_TIER.md). For the rubric defense: [`docs/REWARD_DESIGN.md`](docs/REWARD_DESIGN.md). For the architectural narrative: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md). For the operator guide: [`execution.md`](execution.md). The training notebook lives at [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb).
203
 
204
  ## The claim
205
 
206
+ SystemTruth is the first SRE training environment that grades calibrated confidence as a first-class signal. The rubric tells you exactly where your model is bluffing — to two decimal places, on every commit, with a CI invariant that fails the build if the heuristic ceiling drifts out of band. Train against it and the hidden-state estimate inside your model gets sharper episode by episode. Skip the rubric and your agent stays a chat-window demo.
207
 
208
  Built for the OpenEnv-class hackathon, India 2026 — by the Madhav-GPT / dakshdoesdev team. Apache 2.0.
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: SRE Gym
3
  emoji: 🚨
4
  colorFrom: red
5
  colorTo: yellow
@@ -9,72 +9,44 @@ pinned: false
9
  license: apache-2.0
10
  ---
11
 
12
- # sre-gym — a tier-escalating SRE training environment
13
 
14
  > **Hackathon submission — OpenEnv-class, India 2026**
15
  >
16
  > - 📖 **Blog:** [BLOG.md](BLOG.md)
17
- > - 🚀 **Live HF Space:** https://huggingface.co/spaces/Madhav189/sre-env
18
- > - 💻 **GitHub:** https://github.com/Madhav-GPT/sre-env
19
- > - 🧪 **Training notebook:** [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb)
20
  > - 📊 **Eval results:** [`eval/results/`](eval/results/)
21
  > - 📜 **License:** Apache 2.0
22
 
23
  **Each tier escalates a different dimension. Triage escalates compute, Strategy escalates horizon, Operations escalates realism.** That single sentence is the load-bearing claim of the project.
24
 
25
- The repo's centre of gravity, in priority order:
26
-
27
- 1. **The environment** — 12 incident templates × 6 procgen variants = 72 deterministic scenarios, exposed via the OpenEnv contract (`/reset` / `/step`) on a FastAPI server. Same code path serves the Gradio UI mounted at `/`.
28
- 2. **The reward rubric** — a 5-component composite that sums to exactly 1.0, with a heuristic ceiling pinned to `[0.65, 0.80]` and a scripted-expert floor at `≥0.90`, both enforced by CI invariants on every commit. Includes a calibration term inside `submit_hypothesis` that grades confident-wrong twice as harshly as hedged-wrong — a small **world-modelling** primitive: the agent has to maintain a belief over root causes and emit a calibrated confidence estimate.
29
- 3. **Coliseum** — a parallel-rollout pool server that turns the env into a lease-based HTTP service so any GRPO trainer can drive K-rollouts-per-scenario without holding a Python env per worker.
30
- 4. **Training & datasets (the honest weak point)** — an end-to-end SFT → GRPO pipeline on Qwen2.5-7B-Instruct, trained against a 120-episode trajectory corpus harvested from the env. The pipeline runs cleanly; the corpus and step budget are smaller than they need to be to break the heuristic ceiling on held-out scenarios. **The env is ready to train against; we ran out of compute before the model was.**
31
-
32
- ---
33
-
34
- ## What's in the box
35
-
36
- | Tier | Runnable kind | Scenarios | What "running" means |
37
- |---|---|---|---|
38
- | **Triage** | live HTTP env | 12 templates × 6 entries each (1 base + 5 procgen) = **72 scenarios** | `/reset` + `/step` against the FastAPI server in this Docker image. The Gradio UI drives episodes end-to-end via the same routes. |
39
- | **Strategy** | Python orchestrator | 3 reference YAML scenarios | `sre_gym.strategy.runner.run_strategy` chains Triage episodes together, threading horizon state (unresolved alerts, pending deploys, tech-debt counter, horizon-decay reward). The 28-action universe in the YAML is design spec; the runner uses the Triage 11 actions. |
40
- | **Operations** | Python state-machine simulator | 1 family with 11 chaos patterns | `sre_gym.operations.runner.run_operations` mutates an in-memory 22-node service graph. Same Triage 11 actions. The compose stack alongside the simulator describes the topology an enterprise team would lift into a real cluster — the simulator runs without that lift. |
41
-
42
- The escalation axis is the point: each tier hardens a different bottleneck of building SRE agents in production.
43
-
44
  ---
45
 
46
- ## Quickstart
47
-
48
- ### 5-minute local demo (no API keys, no server, no GPU)
49
-
50
- ```bash
51
- pip install -e .
52
- ollama pull llama3.2
53
- python -m sre_gym.local triage worker_deploy_cascade
54
- ```
55
 
56
- The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary. See [`sre_gym/local.py`](sre_gym/local.py) for the full flag set.
57
 
58
- ### Live HF Space (Triage tier, hosted)
59
 
60
- Open https://huggingface.co/spaces/Madhav189/sre-env. Pick a scenario and a model provider, click **▶ run eval**. Each tick streams the action, env response, reward delta, and the 5-component breakdown.
61
-
62
- ### Local server + Gradio UI
63
 
64
- ```bash
65
- make install
66
- make dev # FastAPI + Gradio on :7860
67
- python -m sre_gym.strategy run cascading_release_train
68
- python -m sre_gym.operations run ecommerce_vibecoded_saas --chaos rls_silent_leak
69
- ```
70
 
71
- The FastAPI server speaks the OpenEnv contract (`/reset /step /state /tasks /baseline /grader /status /health /metadata /schema`) plus an MCP JSON-RPC route at `/mcp`.
72
 
73
- ---
 
 
74
 
75
- ## The Triage tier — the runnable contract
76
 
77
- 12 base templates of one-incident-at-a-time scenarios; each generates 5 procgen variants for 72 scenarios total. The agent has 11 bounded actions:
78
 
79
  ```
80
  query_logs(service) query_metrics(service, metric)
@@ -85,17 +57,33 @@ submit_hypothesis(hypothesis) escalate
85
  declare_resolved
86
  ```
87
 
88
- A successful episode looks like: `gather evidence → submit_hypothesis → rollback_deploy → restart_service → both run_check pass → declare_resolved`. Wrong rollback target, premature restart, or premature resolution all return negative reward and a typed `failure_type`.
89
 
90
- Services live in a 4-node topology (`api-gateway / cache / database / worker`) plus an 11-service noise-decoy pool that surfaces in alerts as decoys but never in queries. Each scenario specifies a root cause, the correct rollback target, the resolution check that must pass, and the decoy traps. See [`docs/TRIAGE_TIER.md`](docs/TRIAGE_TIER.md) for the per-template skill table.
91
 
92
- The Triage server is **the** runnable contract Strategy and Operations chain Triage episodes; Coliseum (below) wraps Triage in a lease-based HTTP shape; the local CLI imports the Triage env in-process. Everything else is a runner shape on top.
93
 
94
- ---
95
 
96
- ## The reward rubric the engineering crown jewel
97
 
98
- Triage uses a **5-component rubric** that sums to exactly 1.0 see [`docs/REWARD_DESIGN.md`](docs/REWARD_DESIGN.md):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
  ```
101
  final_reward = 0.45·outcome
@@ -113,14 +101,7 @@ final_reward = 0.45·outcome
113
  efficiency exp(-current_tick / optimal_ticks_for_template)
114
  ```
115
 
116
- Plus per-tick *shaped* reward (the change in incident-health potential) for dense GRPO signal. Strategy and Operations reuse the Triage rubric and apply a horizon-decay factor over per-phase composites.
117
-
118
- Two reference scores anchor the rubric and are CI-pinned:
119
-
120
- - **Heuristic ceiling `[0.65, 0.80]`** — a naive policy that gathers evidence and submits the correct hypothesis but never remediates lands here. Enforced by `test_heuristic_ceiling_is_in_band` across all 12 templates. The 0.20 gap from 0.80 → 1.00 is the GRPO training target.
121
- - **Scripted-expert floor `≥0.90`** — the optimal canonical solve scores ~0.94 on every template. Enforced by `test_round2_baseline_resolves`.
122
-
123
- Adversarial cheats are first-class:
124
 
125
  | Cheat strategy | Blocked by |
126
  |---|---|
@@ -145,13 +126,19 @@ calibration awards
145
  -1.0 confident-wrong
146
  ```
147
 
148
- The `confidence ∈ [0,1]` field is part of the structured `HypothesisPayload` Pydantic model the agent emits; the calibration sub-term reads it directly. **A model that bluffs high confidence on a wrong root cause is worse than one that hedges.** That's the world-modelling primitive — the env is grading the agent's belief, not just its prediction.
 
 
 
 
 
 
149
 
150
  ---
151
 
152
  ## Coliseum — parallel-rollout pool server
153
 
154
- [`coliseum/`](coliseum/) wraps the Triage env in a lease-based HTTP contract so a GRPO trainer's parallel-rollout side can drive the env without holding an in-process `UnifiedIncidentEnvironment` per worker:
155
 
156
  ```
157
  allocate(task_key) -> {ok: true, lease_id}
@@ -177,11 +164,11 @@ Standard lease-pool pattern — see [`coliseum/README.md`](coliseum/README.md) f
177
 
178
  ## Training & datasets — the honest weak point
179
 
180
- The environment is the project. Training scripts orbit around it. We ran a real end-to-end Triage run on the day of the deadline — pipeline works, results are below the heuristic, the gap is real, and we're saying so.
181
 
182
  ### What we ran
183
 
184
- Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb) (target: A100 80GB, ~2-3h end-to-end):
185
 
186
  1. **SFT cold-start** — Unsloth + Qwen2.5-7B-Instruct (4-bit) + LoRA r=32, 50 steps × batch 16 on a 999-example diverse trajectory corpus. Eval perplexity 1.755 (healthy `[1.5, 3.0]` band). Saved to `outputs/qwen25_7b_sft_final/`.
187
  2. **GRPO online** — TRL's `GRPOTrainer`, 40 steps × K=2 rollouts. Reward = composite + scenario-aware first-action bonus. Saved to `outputs/qwen25_7b_grpo_final/`.
@@ -189,6 +176,8 @@ Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/0
189
 
190
  ### What it produced
191
 
 
 
192
  | policy | mean | median | p25 | p75 | resolved_rate |
193
  |---|---|---|---|---|---|
194
  | random | 0.342 | 0.378 | 0.340 | 0.380 | 0/36 |
@@ -197,6 +186,8 @@ Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/0
197
  | heuristic (queries + correct hypothesis, no remediation) | 0.704 | 0.705 | 0.703 | 0.705 | 0/36 |
198
  | scripted-optimal | 0.938 | 0.939 | 0.937 | 0.940 | 36/36 |
199
 
 
 
200
  Honest reading:
201
  - SFT lifted the model 11% above random (0.342 → 0.379). Format-learning worked: the trained model produces 100% schema-valid `action_type` JSON.
202
  - GRPO did not move the mean further on K=2 / 40 steps / 7B. The advantage signal exists but the budget is too small to overcome the heuristic plateau at 0.704.
@@ -210,6 +201,35 @@ The training scripts in [`train/`](train/) are working as written for the datase
210
 
211
  ---
212
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213
  ## Two-paths agent design
214
 
215
  The repo ships **two independent paths** to a working SRE agent. They share the env contract but trade compute for capability differently.
@@ -223,22 +243,12 @@ ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"
223
  bash demo/run_demo.sh # end-to-end demo
224
  ```
225
 
226
- 12 verified-runbook drafts ship in [`skill/verified-runbooks/`](skill/verified-runbooks/) — one per Triage template. The skill validates them by re-running the env after each solve.
227
-
228
  ### Path B — GRPO-trained adapter (one A100, ~2–3h on 7B)
229
 
230
  The training pipeline above. Path A is what an agent ships *today*; Path B is what raises the floor on the templates it sees over and over.
231
 
232
  ---
233
 
234
- ## The HF Space UI
235
-
236
- The Gradio app at https://huggingface.co/spaces/Madhav189/sre-env is mounted at `/` of the same uvicorn process that serves `/reset` + `/step`. Three tiers selectable as cards (Triage live HTTP, Strategy chained-episode runner, Operations graph simulator). Every run streams per-tick action, env response, reward delta, and the 5-component breakdown — `out=… valid=… fmt=… anti=… eff=…`.
237
-
238
- Provider auth is whatever the user pastes (HF token plus optional Anthropic / OpenAI / Together / Fireworks / Groq / DeepSeek key). Tokens live only on the request instance — never logged, never persisted, never echoed in error messages. CSS theme is GitHub-dark phosphor on a JetBrains Mono base; see [`app.py`](app.py) for the full styling block.
239
-
240
- ---
241
-
242
  ## Tier-aware Python API
243
 
244
  ```python
@@ -268,74 +278,29 @@ Old tier names (`Tier.BASIC`, `Tier.ADVANCED`, `Tier.MAX`) are preserved as Enum
268
 
269
  ```bash
270
  make test # green at HEAD
271
- ruff check . # configured; pre-existing F401 cleanups tracked separately
272
  openenv validate . # green
273
  ```
274
 
275
  The two CI invariants that keep the rubric calibrated:
276
 
277
  - `test_heuristic_ceiling_is_in_band` — naive heuristic in `[0.65, 0.80]` on every template.
278
- - `test_round2_baseline_resolves` — scripted-optimal `≥ 0.90` on the 6 round-2 templates.
279
-
280
- Tighten either side and the gradient signal collapses; loosen either side and a memorising model can game the rubric without learning causality. The band is the load-bearing engineering claim.
281
-
282
- ---
283
-
284
- ## Architecture
285
-
286
- ```
287
- ┌──────────────────────────────────────────────────────────────┐
288
- │ app.py (uvicorn app:app on port 7860) │
289
- │ ├─ Gradio terminal UI mounted at / │
290
- │ └─ FastAPI server (unified_incident_env.server.app) │
291
- │ ├─ /reset /step /state OpenEnv contract │
292
- │ ├─ /tasks /baseline /grader catalogue + scoring │
293
- │ ├─ /status /health ops probes │
294
- │ ├─ /metadata /schema OpenEnv metadata │
295
- │ ├─ /mcp JSON-RPC 2.0 dual-route │
296
- │ ├─ /docs /redoc /openapi.json Swagger / ReDoc │
297
- │ └─ /info /simple legacy markdown landing │
298
- │ │
299
- │ sre_gym/ │
300
- │ ├─ tier.py Tier enum + TierConfig │
301
- │ ├─ env.py SREGym factory (delegates per t.)│
302
- │ ├─ basic_runner.py wrap UnifiedIncidentEnvironment │
303
- │ ├─ strategy/runner.py chain Triage episodes + horizon │
304
- │ ├─ operations/runner.py Python state-machine over 22 nd. │
305
- │ ├─ ui/ providers, router, policies │
306
- │ ├─ local.py in-process CLI for Ollama models │
307
- │ └─ exceptions.py typed errors │
308
- │ │
309
- │ coliseum/ parallel-rollout pool server │
310
- │ ├─ server.py FastAPI lease pool │
311
- │ └─ client.py ArenaClient + create_arena_client│
312
- │ │
313
- │ notebooks/ │
314
- │ └─ 01_triage_train_grpo_qwen25_7b.ipynb SFT → GRPO pipe. │
315
- │ │
316
- │ skill/ Claude Code skill (Path A) │
317
- │ ├─ SKILL.md agent instructions │
318
- │ ├─ tools/ sre-gym HTTP client │
319
- │ └─ verified-runbooks/ 12 per-template runbooks │
320
- └──────────────────────────────────────────────────────────────┘
321
- ```
322
-
323
- Per-tier deep dives in [`docs/TRIAGE_TIER.md`](docs/TRIAGE_TIER.md) / [`docs/STRATEGY_TIER.md`](docs/STRATEGY_TIER.md) / [`docs/OPERATIONS_TIER.md`](docs/OPERATIONS_TIER.md). Reward design: [`docs/REWARD_DESIGN.md`](docs/REWARD_DESIGN.md). Operator guide: [`execution.md`](execution.md). Architectural narrative: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md). Blog: [`BLOG.md`](BLOG.md).
324
 
325
  ---
326
 
327
  ## Materials
328
 
329
- - [`openenv.yaml`](openenv.yaml) — declares the three tiers, runnable kinds, scenario counts.
330
- - [`pyproject.toml`](pyproject.toml) — Python package, deps, entry points.
331
- - [`docs/`](docs/) — architecture, per-tier deep dives, reward design, scenario authoring guide, references.
332
- - [`docs/blog/`](docs/blog/) — the 6 blog assets (hero, topology, rubric donut, chaos timeline, two-paths, baselines bar).
333
- - [`skill/`](skill/) — Claude Code skill packaging (Path A).
334
- - [`coliseum/`](coliseum/) — parallel-rollout pool server.
335
- - [`demo/`](demo/) — `run_demo.sh` end-to-end demo, `pitch.md` narrative.
336
- - [`eval/`](eval/) — held-out split definition, results directory.
337
- - [`train/data/`](train/data/) — teacher trajectories (Claude Opus + Llama-3.3-70B + scripted baselines + 120-episode v2 corpus).
338
- - [`notebooks/`](notebooks/) — Triage SFT→GRPO training (`01_triage_train_grpo_qwen25_7b.ipynb`), eval comparison (`02_triage_eval_compare_all.ipynb`), Strategy + Operations walkthroughs.
339
 
340
  ---
341
 
 
1
  ---
2
+ title: SystemTruth
3
  emoji: 🚨
4
  colorFrom: red
5
  colorTo: yellow
 
9
  license: apache-2.0
10
  ---
11
 
12
+ # SystemTruth — a tier-escalating SRE training environment
13
 
14
  > **Hackathon submission — OpenEnv-class, India 2026**
15
  >
16
  > - 📖 **Blog:** [BLOG.md](BLOG.md)
17
+ > - 🚀 **Live HF Space:** https://huggingface.co/spaces/Madhav189/SystemTruth
18
+ > - 💻 **GitHub:** https://github.com/Madhav-GPT/SystemTruth
19
+ > - 🧪 **Training notebook (Colab):** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu?usp=sharing) — same as [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb)
20
  > - 📊 **Eval results:** [`eval/results/`](eval/results/)
21
  > - 📜 **License:** Apache 2.0
22
 
23
  **Each tier escalates a different dimension. Triage escalates compute, Strategy escalates horizon, Operations escalates realism.** That single sentence is the load-bearing claim of the project.
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ---
26
 
27
+ ## What's in the box (the USP — read this first)
 
 
 
 
 
 
 
 
28
 
29
+ SystemTruth is **one runnable RL environment with three personas baked into it**. The same 11-action contract, the same 5-component reward rubric, the same termination shape escalated along three orthogonal axes that map to the three real bottlenecks SRE-agent training loops actually hit.
30
 
31
+ ![SystemTruth architecture three tiers under one shared 11-action interface + 5-component rubric](docs/blog/system_architecture.png)
32
 
33
+ ### One environment, three tiers, three different bottlenecks
 
 
34
 
35
+ | Tier | Bottleneck | Persona | What it teaches |
36
+ |---|---|---|---|
37
+ | **Triage** | **Compute** | ML student / Kaggle, $30 of HF credits | causal mapping under tight context — pre-digested observations, dense reward shaping, 8K context, 11-action space, 8–13 ticks per episode |
38
+ | **Strategy** | **Horizon** | Seed-stage startup, $300–500 budget | long-horizon planning across chained incidents — multi-incident chains with persistent state, unresolved alerts and pending deploys carry forward, 60–90 ticks |
39
+ | **Operations** | **Realism** | Enterprise SRE platform, 8×A100/H100 cluster | authentic tool use against irreversible actions — 22-node service graph, 11 chaos patterns pinned to real production post-mortems, 110–180+ actions per episode |
 
40
 
41
+ The escalation axis is the entire pitch. Most RL environments stratify by *difficulty* (more scenarios, longer episodes, harder rewards). SystemTruth stratifies by **the dimension that actually limits the training loop for that persona**:
42
 
43
+ - A junior on-call learning to triage faces a different problem (cognitive efficiency under tight context) than a senior SRE running a multi-incident postmortem (state tracking across long horizons), which faces a different problem from an enterprise platform team operating against an actively chaos-engineered cluster (irreversible actions, partial observability, real wall-clock).
44
+ - Their training signals, episode shapes, observation richness, and reward structures should not look the same.
45
+ - SystemTruth takes that observation seriously and stratifies its tiers along *the dimension that actually limits the persona's training loop*.
46
 
47
+ ### The shared 11-action contract
48
 
49
+ Every tier Triage, Strategy, Operations speaks the same eleven Pydantic-validated actions. **One contract, three escalation envelopes:**
50
 
51
  ```
52
  query_logs(service) query_metrics(service, metric)
 
57
  declare_resolved
58
  ```
59
 
60
+ A successful episode is `gather evidence → submit_hypothesis → rollback_deploy → restart_service → both run_check pass → declare_resolved`. Wrong rollback target, premature restart, or premature resolution all return negative reward and a typed `failure_type`. The contract refuses to be gamed.
61
 
62
+ ### The episode lifecycle, illustrated
63
 
64
+ The lifecycle below is the Triage tier in detail; Strategy chains N of them with horizon-decay, Operations runs one of them inside a graph-mutation simulator. **The shape is shared across all three tiers** the simulator under it is what changes.
65
 
66
+ ![SystemTruth episode lifecycle — Triage tier, same shape inherited by Strategy and Operations](docs/blog/episode_lifecycle.png)
67
 
68
+ Eleven numbered stages, each producing a measurable signal:
69
 
70
+ 1. **`reset(scenario_id)`** env emits the initial observation: tick counter, workflow stage, incident summary, active alerts, noise alerts (decoys), service health (cpu/mem/err/latency), user impact, SLO burn rate, checks, allowed actions.
71
+ 2. **Evidence gathering loop** — agent calls `query_logs / query_metrics / query_dependencies / query_deploys`. After every step the env computes a per-tick **shaped reward** as a potential difference (`Δ critical_service_health × 0.55 + Δ (1 − user_impact) × 0.20 + Δ (1 − slo_burn_rate) × 0.15 + containment_applied × 0.10`) minus `step_cost`, plus `bonus`, minus `penalty`.
72
+ 3. **`submit_hypothesis(root_cause, affected_services, confidence, recommended_next_action)`** — the world-modelling primitive. Confidence is a `float ∈ [0,1]` the agent must commit to.
73
+ 4. **Hypothesis correctness check** — if the root cause matches truth, the agent gets an in-episode bonus up to ~0.12 (idempotent — second identical hypothesis scores 0). If wrong, the agent loops back to investigation with a new observation.
74
+ 5. **`rollback_deploy(service)`** — the irreversible action. Wrong target = `unsafe_action_penalty` (0.08 medium / 0.12 hard). Correct target sets `cause_removed = True` and unblocks restart.
75
+ 6. **`restart_service(service)`** — only valid if scenario requires it. Guard: if cause not removed, premature-restart penalty fires and state re-inherits the bad config.
76
+ 7. **`run_check("end_to_end" | "database_recovery")`** — verification gate. If checks fail, agent loops back to investigation.
77
+ 8. **`declare_resolved`** — terminal action. Guard: if checks not passed, `premature_resolution_penalty` (0.20 / 0.30) fires.
78
+ 9. **Episode terminates** — terminal state emitted.
79
+ 10. **Compute composite from terminal state** — the 5-component rubric below evaluates outcome / action_validity / format / anticheat / efficiency, sums to 1.0 with weighted clamping to `[0.01, 0.99]`.
80
+ 11. **Reference scores anchor the rubric** — random `0.417` (0/36 resolved), naive heuristic `0.749` (0/12 resolved), scripted-optimal `0.938` (12/12 resolved). The 0.20 gap from `0.80 → 1.00` is what GRPO trains into.
81
+
82
+ **Cross-tier extension:**
83
+ - **Strategy** chains N Triage episodes, applies a `horizon_decay_factor × mean(per-phase composite)` to the final reward.
84
+ - **Operations** runs the same lifecycle inside a graph-mutation simulator over a 22-node service topology, same rubric, same horizon-decay weighting.
85
+
86
+ ### Reward rubric — the engineering crown jewel
87
 
88
  ```
89
  final_reward = 0.45·outcome
 
101
  efficiency exp(-current_tick / optimal_ticks_for_template)
102
  ```
103
 
104
+ Each component defends against a specific cheat:
 
 
 
 
 
 
 
105
 
106
  | Cheat strategy | Blocked by |
107
  |---|---|
 
126
  -1.0 confident-wrong
127
  ```
128
 
129
+ The `confidence ∈ [0,1]` field is part of the structured `HypothesisPayload` Pydantic model the agent emits; the calibration sub-term reads it directly. **A model that bluffs high confidence on a wrong root cause is worse than one that hedges.** That's the world-modelling primitive — the env grades the agent's belief, not just its prediction.
130
+
131
+ Two CI invariants pin the rubric in place on every commit:
132
+ - **Heuristic ceiling `[0.65, 0.80]`** — `test_heuristic_ceiling_is_in_band` enforces this band on every template. The 0.20 gap from 0.80 → 1.00 is the GRPO training target.
133
+ - **Scripted-expert floor `≥0.90`** — `test_round2_baseline_resolves` enforces ≥0.90 on the round-2 templates.
134
+
135
+ Tighten either side and the gradient signal collapses; loosen either side and a memorising model can game the rubric. The band is the load-bearing engineering claim.
136
 
137
  ---
138
 
139
  ## Coliseum — parallel-rollout pool server
140
 
141
+ [`coliseum/`](coliseum/) wraps the Triage env in a lease-based HTTP contract so a GRPO trainer's parallel-rollout side can drive the env without holding an in-process Python instance per worker:
142
 
143
  ```
144
  allocate(task_key) -> {ok: true, lease_id}
 
164
 
165
  ## Training & datasets — the honest weak point
166
 
167
+ The environment is the project. Training scripts orbit around it. We ran a real end-to-end Triage run on the day of the deadline — pipeline works, results are below the heuristic plateau, the gap is real, and we're saying so.
168
 
169
  ### What we ran
170
 
171
+ Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb) — also openable in Colab via the badge: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu?usp=sharing). Target A100 80GB, ~2-3h end-to-end:
172
 
173
  1. **SFT cold-start** — Unsloth + Qwen2.5-7B-Instruct (4-bit) + LoRA r=32, 50 steps × batch 16 on a 999-example diverse trajectory corpus. Eval perplexity 1.755 (healthy `[1.5, 3.0]` band). Saved to `outputs/qwen25_7b_sft_final/`.
174
  2. **GRPO online** — TRL's `GRPOTrainer`, 40 steps × K=2 rollouts. Reward = composite + scenario-aware first-action bonus. Saved to `outputs/qwen25_7b_grpo_final/`.
 
176
 
177
  ### What it produced
178
 
179
+ ![SystemTruth Triage holdout eval — Qwen2.5-7B, 12 scenarios × 3 seeds](eval/results/qwen25_7b_comparison_hero.png)
180
+
181
  | policy | mean | median | p25 | p75 | resolved_rate |
182
  |---|---|---|---|---|---|
183
  | random | 0.342 | 0.378 | 0.340 | 0.380 | 0/36 |
 
186
  | heuristic (queries + correct hypothesis, no remediation) | 0.704 | 0.705 | 0.703 | 0.705 | 0/36 |
187
  | scripted-optimal | 0.938 | 0.939 | 0.937 | 0.940 | 36/36 |
188
 
189
+ ![Per-template mean score by policy](eval/results/qwen25_7b_comparison_per_template.png)
190
+
191
  Honest reading:
192
  - SFT lifted the model 11% above random (0.342 → 0.379). Format-learning worked: the trained model produces 100% schema-valid `action_type` JSON.
193
  - GRPO did not move the mean further on K=2 / 40 steps / 7B. The advantage signal exists but the budget is too small to overcome the heuristic plateau at 0.704.
 
201
 
202
  ---
203
 
204
+ ## Quickstart
205
+
206
+ ### 5-minute local demo (no API keys, no server, no GPU)
207
+
208
+ ```bash
209
+ pip install -e .
210
+ ollama pull llama3.2
211
+ python -m sre_gym.local triage worker_deploy_cascade
212
+ ```
213
+
214
+ The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary.
215
+
216
+ ### Live HF Space (Triage tier, hosted)
217
+
218
+ Open https://huggingface.co/spaces/Madhav189/SystemTruth. Pick a tier, paste an HF token, click **▶ run eval**. Each tick streams the action, env response, reward delta, and the 5-component breakdown.
219
+
220
+ ### Local server + Gradio UI
221
+
222
+ ```bash
223
+ make install
224
+ make dev # FastAPI + Gradio on :7860
225
+ python -m sre_gym.strategy run cascading_release_train
226
+ python -m sre_gym.operations run ecommerce_vibecoded_saas --chaos rls_silent_leak
227
+ ```
228
+
229
+ The FastAPI server speaks the OpenEnv contract (`/reset /step /state /tasks /baseline /grader /status /health /metadata /schema`) plus an MCP JSON-RPC route at `/mcp`.
230
+
231
+ ---
232
+
233
  ## Two-paths agent design
234
 
235
  The repo ships **two independent paths** to a working SRE agent. They share the env contract but trade compute for capability differently.
 
243
  bash demo/run_demo.sh # end-to-end demo
244
  ```
245
 
 
 
246
  ### Path B — GRPO-trained adapter (one A100, ~2–3h on 7B)
247
 
248
  The training pipeline above. Path A is what an agent ships *today*; Path B is what raises the floor on the templates it sees over and over.
249
 
250
  ---
251
 
 
 
 
 
 
 
 
 
252
  ## Tier-aware Python API
253
 
254
  ```python
 
278
 
279
  ```bash
280
  make test # green at HEAD
281
+ ruff check .
282
  openenv validate . # green
283
  ```
284
 
285
  The two CI invariants that keep the rubric calibrated:
286
 
287
  - `test_heuristic_ceiling_is_in_band` — naive heuristic in `[0.65, 0.80]` on every template.
288
+ - `test_round2_baseline_resolves` — scripted-optimal `≥ 0.90` on the round-2 templates.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
289
 
290
  ---
291
 
292
  ## Materials
293
 
294
+ - [`BLOG.md`](BLOG.md) — the hackathon blog (with all 6 assets in `docs/blog/`)
295
+ - [`openenv.yaml`](openenv.yaml) — declares the three tiers, runnable kinds, scenario counts
296
+ - [`docs/`](docs/) — architecture, per-tier deep dives, reward design, scenario authoring
297
+ - [`docs/blog/`](docs/blog/) — visuals: lifecycle, architecture, hero, topology, rubric donut, chaos timeline, two-paths, baselines bar
298
+ - [`skill/`](skill/) — Claude Code skill packaging (Path A)
299
+ - [`coliseum/`](coliseum/) — parallel-rollout pool server
300
+ - [`demo/`](demo/) — `run_demo.sh` end-to-end demo, `pitch.md` narrative
301
+ - [`eval/`](eval/) — held-out split definition, results directory with the latest eval CSV + plots
302
+ - [`train/data/`](train/data/) — teacher trajectories (Claude Opus + Llama-3.3-70B + scripted baselines + 120-episode v2 corpus)
303
+ - [`notebooks/`](notebooks/) — Triage SFT→GRPO training (`01_triage_train_grpo_qwen25_7b.ipynb`), eval comparison (`02_triage_eval_compare_all.ipynb`), Strategy + Operations walkthroughs
304
 
305
  ---
306
 
app.py CHANGED
@@ -84,9 +84,9 @@ TIER_DEFAULT_MODEL: dict[str, str] = {
84
 
85
 
86
  TIER_DESCRIPTION: dict[str, str] = {
87
- "basic": "escalates compute · 12 templates × 5 procgen variants · single bounded incident",
88
- "advanced": "escalates horizon · chained incidents · persistent state across episodes",
89
- "max": "escalates realism · 22-service ecommerce sim · 11 chaos patterns",
90
  }
91
 
92
 
@@ -334,36 +334,41 @@ gradio-app::before {
334
  }
335
  .sg-panel-label::before { content: '▸'; color: var(--brand); }
336
 
337
- /* ─── INPUTS — token / model / provider key ──────────────────────────── */
338
  .sg-panel-col .form, .sg-panel-col .block { background: transparent !important; }
339
  .sg-panel-col input,
340
  .sg-panel-col textarea,
341
  .sg-panel-col select {
342
- background: var(--bg-input) !important;
343
- border: 1px solid var(--border) !important;
344
  color: var(--text-primary) !important;
345
- font-family: var(--mono) !important; font-size: 12px !important;
346
- padding: 8px 10px !important; border-radius: 0 !important;
 
 
347
  box-shadow: none !important;
 
348
  }
349
  .sg-panel-col input:focus,
350
  .sg-panel-col textarea:focus,
351
  .sg-panel-col select:focus {
352
- border-color: var(--action) !important; outline: none !important;
 
 
353
  }
354
  .sg-panel-col input::placeholder, .sg-panel-col textarea::placeholder {
355
- color: var(--text-faint) !important;
356
  }
357
  /* Field labels — Gradio renders <label><span>LABEL</span> ...</label> */
358
  .sg-panel-col label > span:first-child,
359
  .sg-panel-col .label-wrap > span,
360
  .sg-panel-col .label-wrap span {
361
  color: var(--text-secondary) !important;
362
- font-size: 10px !important;
363
- letter-spacing: 0.12em !important;
364
  text-transform: uppercase !important;
365
- font-weight: 500 !important;
366
- margin-bottom: 5px !important;
367
  }
368
  .sg-panel-col label { background: transparent !important; }
369
 
@@ -371,16 +376,17 @@ gradio-app::before {
371
  .sg-panel-col .dropdown,
372
  .sg-panel-col .wrap-inner,
373
  .sg-panel-col .options {
374
- background: var(--bg-input) !important;
375
- border: 1px solid var(--border) !important;
376
  color: var(--text-primary) !important;
 
377
  }
378
  .sg-panel-col .dropdown ul li:hover,
379
  .sg-panel-col .options li:hover {
380
  background: var(--bg-input-hover) !important;
381
  }
382
 
383
- /* ─── TIER CARDS — 3 styled buttons ──────────────────────────────────── */
384
  .sg-tier-list, .sg-tier-list .form, .sg-tier-list .gap {
385
  display: flex !important; flex-direction: column !important; gap: 8px !important;
386
  background: transparent !important;
@@ -389,14 +395,15 @@ gradio-app::before {
389
  .sg-tier-card button {
390
  display: block !important;
391
  padding: 14px 16px !important;
392
- background: var(--bg-input) !important;
393
- border: 1px solid var(--border) !important;
394
  color: var(--text-secondary) !important;
395
  font-family: var(--mono) !important; font-size: 11.5px !important;
396
  font-weight: 400 !important;
397
  text-align: left !important; cursor: pointer !important;
398
  width: 100% !important; min-height: auto !important;
399
- border-radius: 0 !important; box-shadow: none !important;
 
400
  transition: all 0.15s ease !important;
401
  white-space: pre-line !important;
402
  line-height: 1.55 !important;
@@ -412,16 +419,16 @@ gradio-app::before {
412
  line-height: 2 !important;
413
  }
414
  .sg-tier-card button:hover {
415
- background: var(--bg-input-hover) !important;
416
- border-color: var(--border-strong) !important;
417
  }
418
  .sg-tier-card-selected button {
419
- background: rgba(88, 166, 255, 0.06) !important;
420
- border-color: var(--action) !important;
421
- box-shadow: inset 2px 0 0 var(--action) !important;
422
  }
423
  .sg-tier-card-selected button::first-line {
424
- color: var(--action) !important;
425
  }
426
 
427
  /* ─── TERMINAL ────────────────────────────────────────────────────────── */
@@ -455,13 +462,15 @@ gradio-app::before {
455
  .sg-chrome-status .em { color: var(--text-primary); font-weight: 500; }
456
  .sg-chrome-meta { color: var(--text-dim); font-size: 11px; }
457
  .sg-terminal-body {
458
- padding: 18px 20px 22px;
459
- font-size: 12.5px; line-height: 1.7;
460
  white-space: pre; overflow-x: auto;
461
  background: var(--bg-panel);
462
  background-image: linear-gradient(transparent 50%, rgba(255, 255, 255, 0.012) 50%);
463
  background-size: 100% 3px;
464
- min-height: 480px; max-height: 64vh; overflow-y: auto;
 
 
465
  color: var(--text-primary);
466
  }
467
  .sg-terminal-body .ts { color: var(--timestamp); }
@@ -480,50 +489,60 @@ gradio-app::before {
480
  }
481
  @keyframes sg-blink { 50% { opacity: 0; } }
482
 
483
- /* ─── CONTROLS ROW ────────────────────────────────────────────────────── */
 
 
484
  .sg-controls-row {
485
- padding: 14px 16px !important;
486
  background: var(--bg-panel) !important;
487
  border: 1px solid var(--border) !important;
488
  margin-bottom: 16px !important;
489
- align-items: center !important; gap: 24px !important;
 
 
 
 
 
 
 
 
490
  }
491
- .sg-btn-group { gap: 8px !important; flex-wrap: nowrap !important; }
492
  .sg-btn-primary, .sg-btn-secondary {
493
  flex: 0 0 auto !important; min-width: auto !important;
494
  }
495
  .sg-btn-primary button, .sg-btn-secondary button {
496
  font-family: var(--mono) !important; font-size: 12px !important;
497
- font-weight: 600 !important; letter-spacing: 0.06em !important;
498
  text-transform: uppercase !important;
499
- padding: 9px 16px !important; border-radius: 0 !important;
 
500
  box-shadow: none !important; min-height: auto !important;
501
  cursor: pointer !important; transition: all 0.15s ease !important;
502
  }
503
  .sg-btn-primary button {
504
- background: rgba(63, 185, 80, 0.12) !important;
505
- border: 1px solid var(--success) !important;
506
- color: var(--success) !important;
507
  }
508
- .sg-btn-primary button:hover { background: rgba(63, 185, 80, 0.20) !important; }
509
  .sg-btn-secondary button {
510
- background: var(--bg-input) !important;
511
  border: 1px solid var(--border-strong) !important;
512
  color: var(--text-primary) !important;
513
  }
514
  .sg-btn-secondary button:hover {
515
- background: var(--bg-input-hover) !important;
516
  border-color: var(--border-focus) !important;
517
  }
518
 
519
- /* ─── METRICS BAR ─────────────────────────────────────────────────────── */
520
- /* The HTML sits inside Gradio's html-container force flex on both. */
521
  .sg-metrics-host > div, .sg-metrics-host .prose { background: transparent !important; }
522
  .sg-metrics {
523
  display: flex !important; align-items: center !important;
524
  gap: 24px !important; flex-wrap: wrap !important;
525
  color: var(--text-secondary) !important; font-size: 11px !important;
526
- padding: 4px 0 !important;
527
  }
528
  .sg-metric {
529
  display: flex !important; gap: 6px !important; align-items: center !important;
@@ -536,7 +555,7 @@ gradio-app::before {
536
  color: var(--text-primary) !important; font-weight: 600 !important;
537
  }
538
  .sg-metric .value.r { color: var(--reward) !important; }
539
- .sg-metric .value.s { color: var(--success) !important; }
540
  .sg-rubric {
541
  display: flex !important; align-items: center !important; gap: 14px !important;
542
  padding-left: 18px !important; margin-left: 4px !important;
@@ -558,7 +577,7 @@ gradio-app::before {
558
  height: 3px !important; background: var(--bg-input) !important;
559
  overflow: hidden !important; margin-top: 2px !important;
560
  }
561
- .sg-rubric-bar > div { height: 100% !important; background: var(--success) !important; }
562
 
563
  /* ─── TIER DESCRIPTION (under the cards) ──────────────────────────────── */
564
  .sg-tier-desc, .sg-tier-desc * {
@@ -603,17 +622,18 @@ def _header_html() -> str:
603
  return f"""
604
  <header class="sg-header">
605
  <div class="sg-brand-block">
606
- <div class="sg-brand-mark">SRE-GYM<span>//</span></div>
607
  <div class="sg-brand-tagline">
608
  <em>tier-escalating SRE RL env</em> &nbsp;·&nbsp;
609
- RLVE &nbsp;·&nbsp; {THEME_TAGLINE}
610
  </div>
611
  </div>
612
  <nav class="sg-nav">
613
  <span class="sg-status-dot">env online</span>
614
  <a href="/docs" target="_blank" rel="noopener">api docs</a>
615
  <a href="/mcp/tools" target="_blank" rel="noopener">mcp tools</a>
616
- <a href="/info" target="_blank" rel="noopener">legacy</a>
 
617
  </nav>
618
  </header>
619
  """
@@ -651,11 +671,11 @@ FOOTER_HTML = """
651
  <div>
652
  built for the openenv hackathon · india apr '26
653
  &nbsp;·&nbsp;
654
- <a href="https://github.com/Madhav-GPT/sre-env" target="_blank">github</a>
655
  &nbsp;·&nbsp;
656
- <a href="https://huggingface.co/spaces/Madhav189/sre-env" target="_blank">hf space</a>
657
  &nbsp;·&nbsp;
658
- <a href="https://github.com/Madhav-GPT/sre-env/blob/main/BLOG.md" target="_blank">blog</a>
659
  </div>
660
  <div>multi-rubric reward · RLVE procgen · MCP dual-route</div>
661
  </footer>
@@ -1231,28 +1251,29 @@ def build_app() -> gr.Blocks:
1231
  # ── terminal pane ──────────────────────────────────────────
1232
  terminal = gr.HTML(_initial_terminal_html(), elem_id="sg-terminal-host")
1233
 
1234
- # ── controls + metrics row ────────────────────────────────
1235
- with gr.Row(elem_classes=["sg-controls-row"]):
1236
- with gr.Column(scale=0, min_width=280):
1237
- with gr.Row(elem_classes=["sg-btn-group"]):
1238
- run_btn = gr.Button(
1239
- "▶ RUN EVAL",
1240
- variant="primary",
1241
- elem_classes=["sg-btn-primary"],
1242
- )
1243
- stop_btn = gr.Button(
1244
- "■ STOP",
1245
- elem_classes=["sg-btn-secondary"],
1246
- )
1247
- reset_btn = gr.Button(
1248
- "↻ RESET",
1249
- elem_classes=["sg-btn-secondary"],
1250
- )
1251
- with gr.Column(scale=1):
1252
- metrics = gr.HTML(
1253
- _metric_bar_html(),
1254
- elem_classes=["sg-metrics-host"],
1255
  )
 
 
 
 
 
 
 
 
1256
 
1257
  gr.HTML(FOOTER_HTML)
1258
 
 
84
 
85
 
86
  TIER_DESCRIPTION: dict[str, str] = {
87
+ "basic": "Triage tier · escalates compute · 12 templates × 5 procgen variants · single bounded incident",
88
+ "advanced": "Strategy tier · escalates horizon · chained incidents · persistent state across episodes",
89
+ "max": "Operations tier · escalates realism · 22-service ecommerce sim · 11 chaos patterns",
90
  }
91
 
92
 
 
334
  }
335
  .sg-panel-label::before { content: '▸'; color: var(--brand); }
336
 
337
+ /* ─── INPUTS — token / model / provider key (LIGHTER + BIGGER) ───────── */
338
  .sg-panel-col .form, .sg-panel-col .block { background: transparent !important; }
339
  .sg-panel-col input,
340
  .sg-panel-col textarea,
341
  .sg-panel-col select {
342
+ background: #1f2630 !important; /* lighter than the panel */
343
+ border: 1px solid var(--border-strong) !important;
344
  color: var(--text-primary) !important;
345
+ font-family: var(--mono) !important;
346
+ font-size: 13px !important; /* was 12 */
347
+ padding: 12px 14px !important; /* was 8/10 */
348
+ border-radius: 4px !important; /* was 0 — softer, more usable */
349
  box-shadow: none !important;
350
+ min-height: 42px !important; /* taller for usability */
351
  }
352
  .sg-panel-col input:focus,
353
  .sg-panel-col textarea:focus,
354
  .sg-panel-col select:focus {
355
+ border-color: var(--brand) !important; /* phosphor accent on focus */
356
+ outline: none !important;
357
+ box-shadow: 0 0 0 1px rgba(126, 231, 135, 0.25) !important;
358
  }
359
  .sg-panel-col input::placeholder, .sg-panel-col textarea::placeholder {
360
+ color: var(--text-dim) !important; /* was --text-faint */
361
  }
362
  /* Field labels — Gradio renders <label><span>LABEL</span> ...</label> */
363
  .sg-panel-col label > span:first-child,
364
  .sg-panel-col .label-wrap > span,
365
  .sg-panel-col .label-wrap span {
366
  color: var(--text-secondary) !important;
367
+ font-size: 11px !important;
368
+ letter-spacing: 0.14em !important;
369
  text-transform: uppercase !important;
370
+ font-weight: 600 !important;
371
+ margin-bottom: 6px !important;
372
  }
373
  .sg-panel-col label { background: transparent !important; }
374
 
 
376
  .sg-panel-col .dropdown,
377
  .sg-panel-col .wrap-inner,
378
  .sg-panel-col .options {
379
+ background: #1f2630 !important;
380
+ border: 1px solid var(--border-strong) !important;
381
  color: var(--text-primary) !important;
382
+ border-radius: 4px !important;
383
  }
384
  .sg-panel-col .dropdown ul li:hover,
385
  .sg-panel-col .options li:hover {
386
  background: var(--bg-input-hover) !important;
387
  }
388
 
389
+ /* ─── TIER CARDS — 3 styled buttons (theme-cohesive phosphor accent) ─── */
390
  .sg-tier-list, .sg-tier-list .form, .sg-tier-list .gap {
391
  display: flex !important; flex-direction: column !important; gap: 8px !important;
392
  background: transparent !important;
 
395
  .sg-tier-card button {
396
  display: block !important;
397
  padding: 14px 16px !important;
398
+ background: #1f2630 !important; /* match input bg */
399
+ border: 1px solid var(--border-strong) !important;
400
  color: var(--text-secondary) !important;
401
  font-family: var(--mono) !important; font-size: 11.5px !important;
402
  font-weight: 400 !important;
403
  text-align: left !important; cursor: pointer !important;
404
  width: 100% !important; min-height: auto !important;
405
+ border-radius: 4px !important;
406
+ box-shadow: none !important;
407
  transition: all 0.15s ease !important;
408
  white-space: pre-line !important;
409
  line-height: 1.55 !important;
 
419
  line-height: 2 !important;
420
  }
421
  .sg-tier-card button:hover {
422
+ background: #252d38 !important;
423
+ border-color: var(--border-focus) !important;
424
  }
425
  .sg-tier-card-selected button {
426
+ background: rgba(126, 231, 135, 0.06) !important; /* phosphor wash */
427
+ border-color: var(--brand) !important;
428
+ box-shadow: inset 3px 0 0 var(--brand) !important;
429
  }
430
  .sg-tier-card-selected button::first-line {
431
+ color: var(--brand) !important; /* matches header brand */
432
  }
433
 
434
  /* ─── TERMINAL ────────────────────────────────────────────────────────── */
 
462
  .sg-chrome-status .em { color: var(--text-primary); font-weight: 500; }
463
  .sg-chrome-meta { color: var(--text-dim); font-size: 11px; }
464
  .sg-terminal-body {
465
+ padding: 16px 20px 18px;
466
+ font-size: 12.5px; line-height: 1.65;
467
  white-space: pre; overflow-x: auto;
468
  background: var(--bg-panel);
469
  background-image: linear-gradient(transparent 50%, rgba(255, 255, 255, 0.012) 50%);
470
  background-size: 100% 3px;
471
+ min-height: 280px; /* was 480 visible above the fold */
472
+ max-height: 56vh; /* still scrolls if a long run */
473
+ overflow-y: auto;
474
  color: var(--text-primary);
475
  }
476
  .sg-terminal-body .ts { color: var(--timestamp); }
 
489
  }
490
  @keyframes sg-blink { 50% { opacity: 0; } }
491
 
492
+ /* ─── CONTROLS ROW — stacks vertically: buttons on top, metrics below ── */
493
+ /* Now a gr.Column wrapped with this class — Gradio gives us flex-direction:
494
+ column for free, but we still pin it for browsers that style differently. */
495
  .sg-controls-row {
496
+ padding: 16px 18px !important;
497
  background: var(--bg-panel) !important;
498
  border: 1px solid var(--border) !important;
499
  margin-bottom: 16px !important;
500
+ display: flex !important;
501
+ flex-direction: column !important;
502
+ gap: 14px !important;
503
+ align-items: stretch !important;
504
+ }
505
+ .sg-btn-group {
506
+ gap: 10px !important;
507
+ flex-wrap: wrap !important; /* on narrow screens buttons wrap rather than overflow */
508
+ justify-content: flex-start !important;
509
  }
 
510
  .sg-btn-primary, .sg-btn-secondary {
511
  flex: 0 0 auto !important; min-width: auto !important;
512
  }
513
  .sg-btn-primary button, .sg-btn-secondary button {
514
  font-family: var(--mono) !important; font-size: 12px !important;
515
+ font-weight: 700 !important; letter-spacing: 0.08em !important;
516
  text-transform: uppercase !important;
517
+ padding: 11px 22px !important; /* a touch bigger so it stands alone on its row */
518
+ border-radius: 4px !important;
519
  box-shadow: none !important; min-height: auto !important;
520
  cursor: pointer !important; transition: all 0.15s ease !important;
521
  }
522
  .sg-btn-primary button {
523
+ background: rgba(126, 231, 135, 0.10) !important;
524
+ border: 1px solid var(--brand) !important;
525
+ color: var(--brand) !important;
526
  }
527
+ .sg-btn-primary button:hover { background: rgba(126, 231, 135, 0.18) !important; }
528
  .sg-btn-secondary button {
529
+ background: #1f2630 !important;
530
  border: 1px solid var(--border-strong) !important;
531
  color: var(--text-primary) !important;
532
  }
533
  .sg-btn-secondary button:hover {
534
+ background: #252d38 !important;
535
  border-color: var(--border-focus) !important;
536
  }
537
 
538
+ /* ─── METRICS BAR (now sits under the run buttons) ───────────────────── */
539
+ .sg-metrics-host { padding-top: 8px !important; border-top: 1px solid var(--border) !important; }
540
  .sg-metrics-host > div, .sg-metrics-host .prose { background: transparent !important; }
541
  .sg-metrics {
542
  display: flex !important; align-items: center !important;
543
  gap: 24px !important; flex-wrap: wrap !important;
544
  color: var(--text-secondary) !important; font-size: 11px !important;
545
+ padding: 6px 0 0 !important;
546
  }
547
  .sg-metric {
548
  display: flex !important; gap: 6px !important; align-items: center !important;
 
555
  color: var(--text-primary) !important; font-weight: 600 !important;
556
  }
557
  .sg-metric .value.r { color: var(--reward) !important; }
558
+ .sg-metric .value.s { color: var(--brand) !important; } /* phosphor — theme cohesion */
559
  .sg-rubric {
560
  display: flex !important; align-items: center !important; gap: 14px !important;
561
  padding-left: 18px !important; margin-left: 4px !important;
 
577
  height: 3px !important; background: var(--bg-input) !important;
578
  overflow: hidden !important; margin-top: 2px !important;
579
  }
580
+ .sg-rubric-bar > div { height: 100% !important; background: var(--brand) !important; }
581
 
582
  /* ─── TIER DESCRIPTION (under the cards) ──────────────────────────────── */
583
  .sg-tier-desc, .sg-tier-desc * {
 
622
  return f"""
623
  <header class="sg-header">
624
  <div class="sg-brand-block">
625
+ <div class="sg-brand-mark">SystemTruth<span>//</span></div>
626
  <div class="sg-brand-tagline">
627
  <em>tier-escalating SRE RL env</em> &nbsp;·&nbsp;
628
+ Triage / Strategy / Operations &nbsp;·&nbsp; {THEME_TAGLINE}
629
  </div>
630
  </div>
631
  <nav class="sg-nav">
632
  <span class="sg-status-dot">env online</span>
633
  <a href="/docs" target="_blank" rel="noopener">api docs</a>
634
  <a href="/mcp/tools" target="_blank" rel="noopener">mcp tools</a>
635
+ <a href="https://github.com/Madhav-GPT/SystemTruth" target="_blank" rel="noopener">github</a>
636
+ <a href="https://github.com/Madhav-GPT/SystemTruth/blob/main/BLOG.md" target="_blank" rel="noopener">blog</a>
637
  </nav>
638
  </header>
639
  """
 
671
  <div>
672
  built for the openenv hackathon · india apr '26
673
  &nbsp;·&nbsp;
674
+ <a href="https://github.com/Madhav-GPT/SystemTruth" target="_blank">github</a>
675
  &nbsp;·&nbsp;
676
+ <a href="https://huggingface.co/spaces/Madhav189/SystemTruth" target="_blank">hf space</a>
677
  &nbsp;·&nbsp;
678
+ <a href="https://github.com/Madhav-GPT/SystemTruth/blob/main/BLOG.md" target="_blank">blog</a>
679
  </div>
680
  <div>multi-rubric reward · RLVE procgen · MCP dual-route</div>
681
  </footer>
 
1251
  # ── terminal pane ──────────────────────────────────────────
1252
  terminal = gr.HTML(_initial_terminal_html(), elem_id="sg-terminal-host")
1253
 
1254
+ # ── controls + metrics stacked vertically (buttons on top, ──
1255
+ # metrics below). Using a single Column with two children means
1256
+ # the metrics bar gets the full width on its own row instead of
1257
+ # fighting the buttons for horizontal space.
1258
+ with gr.Column(elem_classes=["sg-controls-row"]):
1259
+ with gr.Row(elem_classes=["sg-btn-group"]):
1260
+ run_btn = gr.Button(
1261
+ "▶ RUN EVAL",
1262
+ variant="primary",
1263
+ elem_classes=["sg-btn-primary"],
1264
+ )
1265
+ stop_btn = gr.Button(
1266
+ "■ STOP",
1267
+ elem_classes=["sg-btn-secondary"],
 
 
 
 
 
 
 
1268
  )
1269
+ reset_btn = gr.Button(
1270
+ "↻ RESET",
1271
+ elem_classes=["sg-btn-secondary"],
1272
+ )
1273
+ metrics = gr.HTML(
1274
+ _metric_bar_html(),
1275
+ elem_classes=["sg-metrics-host"],
1276
+ )
1277
 
1278
  gr.HTML(FOOTER_HTML)
1279
 
docs/TRIAGE_TIER.md CHANGED
@@ -179,7 +179,7 @@ End-to-end: ~2-3h on a single A100 80GB, ~$5–8 of HF compute credits.
179
  - ✅ 12 templates × 5 procgen variants = 72 scenarios live
180
  - ✅ Pytest suite green
181
  - ✅ `openenv validate .` green
182
- - ✅ HF Space deployed at `Madhav189-sre-env.hf.space`
183
  - ✅ Training notebook runs end-to-end on A100 80GB
184
  - ✅ Eval comparison cell produces hero bar + per-template chart
185
  - ✅ Trained Qwen2.5-7B adapter saved to `outputs/qwen25_7b_grpo_final/`
 
179
  - ✅ 12 templates × 5 procgen variants = 72 scenarios live
180
  - ✅ Pytest suite green
181
  - ✅ `openenv validate .` green
182
+ - ✅ HF Space deployed at `Madhav189-SystemTruth.hf.space`
183
  - ✅ Training notebook runs end-to-end on A100 80GB
184
  - ✅ Eval comparison cell produces hero bar + per-template chart
185
  - ✅ Trained Qwen2.5-7B adapter saved to `outputs/qwen25_7b_grpo_final/`
docs/blog/episode_lifecycle.png ADDED

Git LFS Details

  • SHA256: c2c6cb6f943bec6f05e710a392a522c16fe0ef6d07dd0b28e6e47b5ebe8d8342
  • Pointer size: 132 Bytes
  • Size of remote file: 1.65 MB
docs/blog/system_architecture.png ADDED

Git LFS Details

  • SHA256: 3053dde32d4b31c2ce8f1bded635c48ce27e9ed38c78d0733d74d1617193f8bb
  • Pointer size: 132 Bytes
  • Size of remote file: 1.8 MB
eval/results/qwen25_7b_comparison_hero.png ADDED

Git LFS Details

  • SHA256: 024960f21e6eaa4ad726bfc18fe7289d587dfdf12d2b02fd31fe2c4f8faaab03
  • Pointer size: 130 Bytes
  • Size of remote file: 68.4 kB
eval/results/qwen25_7b_comparison_per_template.png ADDED

Git LFS Details

  • SHA256: 11c9aae6e8ee1bbc8653c6a7f50971adb628e688ab4883647828c2f4f80c1706
  • Pointer size: 131 Bytes
  • Size of remote file: 104 kB
execution.md CHANGED
@@ -67,7 +67,7 @@ The honest framing: **the env is the project, the rubric is the engineering crow
67
  ## 2. Local setup
68
 
69
  ```bash
70
- git clone https://github.com/Madhav-GPT/sre-env.git
71
  cd sre-env
72
 
73
  python3 -m venv .venv
@@ -280,7 +280,7 @@ app_port: 7860
280
 
281
  ```bash
282
  # One-time: add the HF Space as a git remote
283
- git remote add hf https://huggingface.co/spaces/Madhav189/sre-env
284
 
285
  # Push (HF prompts for token if not cached)
286
  git push hf main
@@ -340,8 +340,8 @@ bash demo/run_demo.sh
340
 
341
  ## 12. Submission checklist
342
 
343
- - [x] Repo public on GitHub: https://github.com/Madhav-GPT/sre-env
344
- - [x] HF Space live: https://huggingface.co/spaces/Madhav189/sre-env
345
  - [x] BLOG.md at repo root
346
  - [x] 6 blog assets in `docs/blog/`
347
  - [x] Training notebook executed end-to-end, results in `eval/results/`
 
67
  ## 2. Local setup
68
 
69
  ```bash
70
+ git clone https://github.com/Madhav-GPT/SystemTruth.git
71
  cd sre-env
72
 
73
  python3 -m venv .venv
 
280
 
281
  ```bash
282
  # One-time: add the HF Space as a git remote
283
+ git remote add hf https://huggingface.co/spaces/Madhav189/SystemTruth
284
 
285
  # Push (HF prompts for token if not cached)
286
  git push hf main
 
340
 
341
  ## 12. Submission checklist
342
 
343
+ - [x] Repo public on GitHub: https://github.com/Madhav-GPT/SystemTruth
344
+ - [x] HF Space live: https://huggingface.co/spaces/Madhav189/SystemTruth
345
  - [x] BLOG.md at repo root
346
  - [x] 6 blog assets in `docs/blog/`
347
  - [x] Training notebook executed end-to-end, results in `eval/results/`
openenv.yaml CHANGED
@@ -116,7 +116,7 @@ training:
116
  templates_covered: 12 # all 12 Triage templates have ≥5 episodes
117
 
118
  huggingface:
119
- space_id: Madhav189/sre-env # canonical HF Space
120
- github_repo: Madhav-GPT/sre-env # canonical GitHub
121
  sdk: docker
122
  hardware: cpu-basic
 
116
  templates_covered: 12 # all 12 Triage templates have ≥5 episodes
117
 
118
  huggingface:
119
+ space_id: Madhav189/SystemTruth # canonical HF Space
120
+ github_repo: Madhav-GPT/SystemTruth # canonical GitHub
121
  sdk: docker
122
  hardware: cpu-basic