Spaces:
Running
SystemTruth rebrand: bigger UI, new diagrams, theme-cohesive HF Space
Browse files- Rebrand sre-env -> SystemTruth across user-facing surfaces:
* GitHub: github.com/Madhav-GPT/SystemTruth
* HF Space: huggingface.co/spaces/Madhav189/SystemTruth
* Live URL: Madhav189-SystemTruth.hf.space
* HF Space header brand mark + tagline
Internal package name (sre_gym/) is preserved as backwards-compat;
the user-facing surface is fully renamed.
- README rewrite: much bigger "what's in the box" section explaining
the three-tier USP (Triage compute / Strategy horizon / Operations
realism). Embeds the 4 visual assets:
* docs/blog/system_architecture.png (architecture diagram)
* docs/blog/episode_lifecycle.png (lifecycle diagram)
* eval/results/qwen25_7b_comparison_hero.png (hero bar)
* eval/results/qwen25_7b_comparison_per_template.png (per-template)
Adds the Colab badge link to
colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu
for the training notebook.
- HF Space UI fixes per user critique:
* Inputs lighter (#1f2630, was bg-input #161b22) and bigger
(padding 12px 14px, font-size 13px, min-height 42px, border-radius
4px) for usability.
* Tier cards now use phosphor-green accent (matches header brand mark)
instead of out-of-place blue. New text reads
"Triage / Strategy / Operations" instead of "Basic / Advanced / Max".
* Controls row now stacks vertically: run/stop/reset buttons on top,
metrics + rubric bar BELOW (was side-by-side, cramped).
* Terminal min-height reduced 480 -> 280 so it's visible above the
fold without scrolling on a typical laptop viewport.
* RUN EVAL button uses brand phosphor green (was generic Gradio
success-green); rubric bars + resolved-rate value also use the
brand color for theme cohesion.
- BLOG.md title updated: "sre-gym" -> "SystemTruth", body refs follow.
- execution.md: clone URL points at Madhav-GPT/SystemTruth.
- openenv.yaml: huggingface.space_id and github_repo updated.
- Eval result PNGs (qwen25_7b_comparison_hero + per_template)
added to git via .gitignore exception so the README + BLOG
references resolve without a separate fetch.
- .gitignore +3 -0
- BLOG.md +6 -6
- README.md +98 -133
- app.py +94 -73
- docs/TRIAGE_TIER.md +1 -1
- docs/blog/episode_lifecycle.png +3 -0
- docs/blog/system_architecture.png +3 -0
- eval/results/qwen25_7b_comparison_hero.png +3 -0
- eval/results/qwen25_7b_comparison_per_template.png +3 -0
- execution.md +4 -4
- openenv.yaml +2 -2
|
@@ -29,3 +29,6 @@ eval/results/*.png
|
|
| 29 |
eval/results/*.jsonl
|
| 30 |
!eval/results/.gitkeep
|
| 31 |
!eval/results/README.md
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
eval/results/*.jsonl
|
| 30 |
!eval/results/.gitkeep
|
| 31 |
!eval/results/README.md
|
| 32 |
+
# Exception: keep the canonical charts referenced from README/BLOG
|
| 33 |
+
!eval/results/qwen25_7b_comparison_hero.png
|
| 34 |
+
!eval/results/qwen25_7b_comparison_per_template.png
|
|
@@ -1,12 +1,12 @@
|
|
| 1 |
---
|
| 2 |
-
title: "
|
| 3 |
thumbnail: docs/blog/hero_three_tiers.png
|
| 4 |
authors:
|
| 5 |
- user: Madhav189
|
| 6 |
- user: dakshdoesdev
|
| 7 |
---
|
| 8 |
|
| 9 |
-
#
|
| 10 |
|
| 11 |
**TL;DR**
|
| 12 |
|
|
@@ -18,7 +18,7 @@ authors:
|
|
| 18 |
|
| 19 |
## Why this matters (read this first)
|
| 20 |
|
| 21 |
-
Calibrated incident-response is the capability gap. Every general-purpose LLM is bad at it: they hallucinate confident root causes, over-trust the loudest signal, skip verification, and declare incidents resolved before checking anything. Those failure modes are invisible in chat demos and catastrophic in production. **
|
| 22 |
|
| 23 |
We treat incident-response as a small **world-modelling** problem: the agent has to maintain a hidden-state estimate of which service is actually broken, update it from noisy observations, and commit to irreversible actions under uncertainty. The 5-component rubric grades the *mechanical signature* of that loop — evidence first, hypothesis with calibrated confidence, remediation, verification, only then resolution — instead of rewarding output that merely looks right.
|
| 24 |
|
|
@@ -37,7 +37,7 @@ ollama pull llama3.2
|
|
| 37 |
python -m sre_gym.local triage worker_deploy_cascade
|
| 38 |
```
|
| 39 |
|
| 40 |
-
The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary. Same code path as the HF Space at `https://huggingface.co/spaces/Madhav189/
|
| 41 |
|
| 42 |
## Three tiers, three bottlenecks
|
| 43 |
|
|
@@ -197,12 +197,12 @@ Reading post-mortems, not blog posts. Fly.io's gossip-protocol deadlock from Oct
|
|
| 197 |
|
| 198 |
The Triage env is live. Pick a scenario, pick a model provider, watch each tick stream the action, env response, reward delta, and rubric breakdown.
|
| 199 |
|
| 200 |
-
<iframe src="https://Madhav189-
|
| 201 |
|
| 202 |
For the per-tier deep dives: [`docs/TRIAGE_TIER.md`](docs/TRIAGE_TIER.md) · [`docs/STRATEGY_TIER.md`](docs/STRATEGY_TIER.md) · [`docs/OPERATIONS_TIER.md`](docs/OPERATIONS_TIER.md). For the rubric defense: [`docs/REWARD_DESIGN.md`](docs/REWARD_DESIGN.md). For the architectural narrative: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md). For the operator guide: [`execution.md`](execution.md). The training notebook lives at [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb).
|
| 203 |
|
| 204 |
## The claim
|
| 205 |
|
| 206 |
-
|
| 207 |
|
| 208 |
Built for the OpenEnv-class hackathon, India 2026 — by the Madhav-GPT / dakshdoesdev team. Apache 2.0.
|
|
|
|
| 1 |
---
|
| 2 |
+
title: "SystemTruth — three tiers of SRE incident-response, one rubric that won't let you fake it"
|
| 3 |
thumbnail: docs/blog/hero_three_tiers.png
|
| 4 |
authors:
|
| 5 |
- user: Madhav189
|
| 6 |
- user: dakshdoesdev
|
| 7 |
---
|
| 8 |
|
| 9 |
+
# SystemTruth — three tiers of SRE incident-response, one rubric that won't let you fake it
|
| 10 |
|
| 11 |
**TL;DR**
|
| 12 |
|
|
|
|
| 18 |
|
| 19 |
## Why this matters (read this first)
|
| 20 |
|
| 21 |
+
Calibrated incident-response is the capability gap. Every general-purpose LLM is bad at it: they hallucinate confident root causes, over-trust the loudest signal, skip verification, and declare incidents resolved before checking anything. Those failure modes are invisible in chat demos and catastrophic in production. **SystemTruth makes them legible enough to measure, then small enough to fix** — and exposes the env via the OpenEnv contract so any RL stack can train against it.
|
| 22 |
|
| 23 |
We treat incident-response as a small **world-modelling** problem: the agent has to maintain a hidden-state estimate of which service is actually broken, update it from noisy observations, and commit to irreversible actions under uncertainty. The 5-component rubric grades the *mechanical signature* of that loop — evidence first, hypothesis with calibrated confidence, remediation, verification, only then resolution — instead of rewarding output that merely looks right.
|
| 24 |
|
|
|
|
| 37 |
python -m sre_gym.local triage worker_deploy_cascade
|
| 38 |
```
|
| 39 |
|
| 40 |
+
The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary. Same code path as the HF Space at `https://huggingface.co/spaces/Madhav189/SystemTruth` — just without the Gradio UI in front of it.
|
| 41 |
|
| 42 |
## Three tiers, three bottlenecks
|
| 43 |
|
|
|
|
| 197 |
|
| 198 |
The Triage env is live. Pick a scenario, pick a model provider, watch each tick stream the action, env response, reward delta, and rubric breakdown.
|
| 199 |
|
| 200 |
+
<iframe src="https://Madhav189-SystemTruth.hf.space" frameborder="0" width="100%" height="800"></iframe>
|
| 201 |
|
| 202 |
For the per-tier deep dives: [`docs/TRIAGE_TIER.md`](docs/TRIAGE_TIER.md) · [`docs/STRATEGY_TIER.md`](docs/STRATEGY_TIER.md) · [`docs/OPERATIONS_TIER.md`](docs/OPERATIONS_TIER.md). For the rubric defense: [`docs/REWARD_DESIGN.md`](docs/REWARD_DESIGN.md). For the architectural narrative: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md). For the operator guide: [`execution.md`](execution.md). The training notebook lives at [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb).
|
| 203 |
|
| 204 |
## The claim
|
| 205 |
|
| 206 |
+
SystemTruth is the first SRE training environment that grades calibrated confidence as a first-class signal. The rubric tells you exactly where your model is bluffing — to two decimal places, on every commit, with a CI invariant that fails the build if the heuristic ceiling drifts out of band. Train against it and the hidden-state estimate inside your model gets sharper episode by episode. Skip the rubric and your agent stays a chat-window demo.
|
| 207 |
|
| 208 |
Built for the OpenEnv-class hackathon, India 2026 — by the Madhav-GPT / dakshdoesdev team. Apache 2.0.
|
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: 🚨
|
| 4 |
colorFrom: red
|
| 5 |
colorTo: yellow
|
|
@@ -9,72 +9,44 @@ pinned: false
|
|
| 9 |
license: apache-2.0
|
| 10 |
---
|
| 11 |
|
| 12 |
-
#
|
| 13 |
|
| 14 |
> **Hackathon submission — OpenEnv-class, India 2026**
|
| 15 |
>
|
| 16 |
> - 📖 **Blog:** [BLOG.md](BLOG.md)
|
| 17 |
-
> - 🚀 **Live HF Space:** https://huggingface.co/spaces/Madhav189/
|
| 18 |
-
> - 💻 **GitHub:** https://github.com/Madhav-GPT/
|
| 19 |
-
> - 🧪 **Training notebook:** [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb)
|
| 20 |
> - 📊 **Eval results:** [`eval/results/`](eval/results/)
|
| 21 |
> - 📜 **License:** Apache 2.0
|
| 22 |
|
| 23 |
**Each tier escalates a different dimension. Triage escalates compute, Strategy escalates horizon, Operations escalates realism.** That single sentence is the load-bearing claim of the project.
|
| 24 |
|
| 25 |
-
The repo's centre of gravity, in priority order:
|
| 26 |
-
|
| 27 |
-
1. **The environment** — 12 incident templates × 6 procgen variants = 72 deterministic scenarios, exposed via the OpenEnv contract (`/reset` / `/step`) on a FastAPI server. Same code path serves the Gradio UI mounted at `/`.
|
| 28 |
-
2. **The reward rubric** — a 5-component composite that sums to exactly 1.0, with a heuristic ceiling pinned to `[0.65, 0.80]` and a scripted-expert floor at `≥0.90`, both enforced by CI invariants on every commit. Includes a calibration term inside `submit_hypothesis` that grades confident-wrong twice as harshly as hedged-wrong — a small **world-modelling** primitive: the agent has to maintain a belief over root causes and emit a calibrated confidence estimate.
|
| 29 |
-
3. **Coliseum** — a parallel-rollout pool server that turns the env into a lease-based HTTP service so any GRPO trainer can drive K-rollouts-per-scenario without holding a Python env per worker.
|
| 30 |
-
4. **Training & datasets (the honest weak point)** — an end-to-end SFT → GRPO pipeline on Qwen2.5-7B-Instruct, trained against a 120-episode trajectory corpus harvested from the env. The pipeline runs cleanly; the corpus and step budget are smaller than they need to be to break the heuristic ceiling on held-out scenarios. **The env is ready to train against; we ran out of compute before the model was.**
|
| 31 |
-
|
| 32 |
-
---
|
| 33 |
-
|
| 34 |
-
## What's in the box
|
| 35 |
-
|
| 36 |
-
| Tier | Runnable kind | Scenarios | What "running" means |
|
| 37 |
-
|---|---|---|---|
|
| 38 |
-
| **Triage** | live HTTP env | 12 templates × 6 entries each (1 base + 5 procgen) = **72 scenarios** | `/reset` + `/step` against the FastAPI server in this Docker image. The Gradio UI drives episodes end-to-end via the same routes. |
|
| 39 |
-
| **Strategy** | Python orchestrator | 3 reference YAML scenarios | `sre_gym.strategy.runner.run_strategy` chains Triage episodes together, threading horizon state (unresolved alerts, pending deploys, tech-debt counter, horizon-decay reward). The 28-action universe in the YAML is design spec; the runner uses the Triage 11 actions. |
|
| 40 |
-
| **Operations** | Python state-machine simulator | 1 family with 11 chaos patterns | `sre_gym.operations.runner.run_operations` mutates an in-memory 22-node service graph. Same Triage 11 actions. The compose stack alongside the simulator describes the topology an enterprise team would lift into a real cluster — the simulator runs without that lift. |
|
| 41 |
-
|
| 42 |
-
The escalation axis is the point: each tier hardens a different bottleneck of building SRE agents in production.
|
| 43 |
-
|
| 44 |
---
|
| 45 |
|
| 46 |
-
##
|
| 47 |
-
|
| 48 |
-
### 5-minute local demo (no API keys, no server, no GPU)
|
| 49 |
-
|
| 50 |
-
```bash
|
| 51 |
-
pip install -e .
|
| 52 |
-
ollama pull llama3.2
|
| 53 |
-
python -m sre_gym.local triage worker_deploy_cascade
|
| 54 |
-
```
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
### Local server + Gradio UI
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
```
|
| 70 |
|
| 71 |
-
The
|
| 72 |
|
| 73 |
-
---
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
## The
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
```
|
| 80 |
query_logs(service) query_metrics(service, metric)
|
|
@@ -85,17 +57,33 @@ submit_hypothesis(hypothesis) escalate
|
|
| 85 |
declare_resolved
|
| 86 |
```
|
| 87 |
|
| 88 |
-
A successful episode
|
| 89 |
|
| 90 |
-
|
| 91 |
|
| 92 |
-
The
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
-
|
| 97 |
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
```
|
| 101 |
final_reward = 0.45·outcome
|
|
@@ -113,14 +101,7 @@ final_reward = 0.45·outcome
|
|
| 113 |
efficiency exp(-current_tick / optimal_ticks_for_template)
|
| 114 |
```
|
| 115 |
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
Two reference scores anchor the rubric and are CI-pinned:
|
| 119 |
-
|
| 120 |
-
- **Heuristic ceiling `[0.65, 0.80]`** — a naive policy that gathers evidence and submits the correct hypothesis but never remediates lands here. Enforced by `test_heuristic_ceiling_is_in_band` across all 12 templates. The 0.20 gap from 0.80 → 1.00 is the GRPO training target.
|
| 121 |
-
- **Scripted-expert floor `≥0.90`** — the optimal canonical solve scores ~0.94 on every template. Enforced by `test_round2_baseline_resolves`.
|
| 122 |
-
|
| 123 |
-
Adversarial cheats are first-class:
|
| 124 |
|
| 125 |
| Cheat strategy | Blocked by |
|
| 126 |
|---|---|
|
|
@@ -145,13 +126,19 @@ calibration awards
|
|
| 145 |
-1.0 confident-wrong
|
| 146 |
```
|
| 147 |
|
| 148 |
-
The `confidence ∈ [0,1]` field is part of the structured `HypothesisPayload` Pydantic model the agent emits; the calibration sub-term reads it directly. **A model that bluffs high confidence on a wrong root cause is worse than one that hedges.** That's the world-modelling primitive — the env
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
---
|
| 151 |
|
| 152 |
## Coliseum — parallel-rollout pool server
|
| 153 |
|
| 154 |
-
[`coliseum/`](coliseum/) wraps the Triage env in a lease-based HTTP contract so a GRPO trainer's parallel-rollout side can drive the env without holding an in-process
|
| 155 |
|
| 156 |
```
|
| 157 |
allocate(task_key) -> {ok: true, lease_id}
|
|
@@ -177,11 +164,11 @@ Standard lease-pool pattern — see [`coliseum/README.md`](coliseum/README.md) f
|
|
| 177 |
|
| 178 |
## Training & datasets — the honest weak point
|
| 179 |
|
| 180 |
-
The environment is the project. Training scripts orbit around it. We ran a real end-to-end Triage run on the day of the deadline — pipeline works, results are below the heuristic, the gap is real, and we're saying so.
|
| 181 |
|
| 182 |
### What we ran
|
| 183 |
|
| 184 |
-
Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb) (
|
| 185 |
|
| 186 |
1. **SFT cold-start** — Unsloth + Qwen2.5-7B-Instruct (4-bit) + LoRA r=32, 50 steps × batch 16 on a 999-example diverse trajectory corpus. Eval perplexity 1.755 (healthy `[1.5, 3.0]` band). Saved to `outputs/qwen25_7b_sft_final/`.
|
| 187 |
2. **GRPO online** — TRL's `GRPOTrainer`, 40 steps × K=2 rollouts. Reward = composite + scenario-aware first-action bonus. Saved to `outputs/qwen25_7b_grpo_final/`.
|
|
@@ -189,6 +176,8 @@ Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/0
|
|
| 189 |
|
| 190 |
### What it produced
|
| 191 |
|
|
|
|
|
|
|
| 192 |
| policy | mean | median | p25 | p75 | resolved_rate |
|
| 193 |
|---|---|---|---|---|---|
|
| 194 |
| random | 0.342 | 0.378 | 0.340 | 0.380 | 0/36 |
|
|
@@ -197,6 +186,8 @@ Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/0
|
|
| 197 |
| heuristic (queries + correct hypothesis, no remediation) | 0.704 | 0.705 | 0.703 | 0.705 | 0/36 |
|
| 198 |
| scripted-optimal | 0.938 | 0.939 | 0.937 | 0.940 | 36/36 |
|
| 199 |
|
|
|
|
|
|
|
| 200 |
Honest reading:
|
| 201 |
- SFT lifted the model 11% above random (0.342 → 0.379). Format-learning worked: the trained model produces 100% schema-valid `action_type` JSON.
|
| 202 |
- GRPO did not move the mean further on K=2 / 40 steps / 7B. The advantage signal exists but the budget is too small to overcome the heuristic plateau at 0.704.
|
|
@@ -210,6 +201,35 @@ The training scripts in [`train/`](train/) are working as written for the datase
|
|
| 210 |
|
| 211 |
---
|
| 212 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 213 |
## Two-paths agent design
|
| 214 |
|
| 215 |
The repo ships **two independent paths** to a working SRE agent. They share the env contract but trade compute for capability differently.
|
|
@@ -223,22 +243,12 @@ ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"
|
|
| 223 |
bash demo/run_demo.sh # end-to-end demo
|
| 224 |
```
|
| 225 |
|
| 226 |
-
12 verified-runbook drafts ship in [`skill/verified-runbooks/`](skill/verified-runbooks/) — one per Triage template. The skill validates them by re-running the env after each solve.
|
| 227 |
-
|
| 228 |
### Path B — GRPO-trained adapter (one A100, ~2–3h on 7B)
|
| 229 |
|
| 230 |
The training pipeline above. Path A is what an agent ships *today*; Path B is what raises the floor on the templates it sees over and over.
|
| 231 |
|
| 232 |
---
|
| 233 |
|
| 234 |
-
## The HF Space UI
|
| 235 |
-
|
| 236 |
-
The Gradio app at https://huggingface.co/spaces/Madhav189/sre-env is mounted at `/` of the same uvicorn process that serves `/reset` + `/step`. Three tiers selectable as cards (Triage live HTTP, Strategy chained-episode runner, Operations graph simulator). Every run streams per-tick action, env response, reward delta, and the 5-component breakdown — `out=… valid=… fmt=… anti=… eff=…`.
|
| 237 |
-
|
| 238 |
-
Provider auth is whatever the user pastes (HF token plus optional Anthropic / OpenAI / Together / Fireworks / Groq / DeepSeek key). Tokens live only on the request instance — never logged, never persisted, never echoed in error messages. CSS theme is GitHub-dark phosphor on a JetBrains Mono base; see [`app.py`](app.py) for the full styling block.
|
| 239 |
-
|
| 240 |
-
---
|
| 241 |
-
|
| 242 |
## Tier-aware Python API
|
| 243 |
|
| 244 |
```python
|
|
@@ -268,74 +278,29 @@ Old tier names (`Tier.BASIC`, `Tier.ADVANCED`, `Tier.MAX`) are preserved as Enum
|
|
| 268 |
|
| 269 |
```bash
|
| 270 |
make test # green at HEAD
|
| 271 |
-
ruff check .
|
| 272 |
openenv validate . # green
|
| 273 |
```
|
| 274 |
|
| 275 |
The two CI invariants that keep the rubric calibrated:
|
| 276 |
|
| 277 |
- `test_heuristic_ceiling_is_in_band` — naive heuristic in `[0.65, 0.80]` on every template.
|
| 278 |
-
- `test_round2_baseline_resolves` — scripted-optimal `≥ 0.90` on the
|
| 279 |
-
|
| 280 |
-
Tighten either side and the gradient signal collapses; loosen either side and a memorising model can game the rubric without learning causality. The band is the load-bearing engineering claim.
|
| 281 |
-
|
| 282 |
-
---
|
| 283 |
-
|
| 284 |
-
## Architecture
|
| 285 |
-
|
| 286 |
-
```
|
| 287 |
-
┌──────────────────────────────────────────────────────────────┐
|
| 288 |
-
│ app.py (uvicorn app:app on port 7860) │
|
| 289 |
-
│ ├─ Gradio terminal UI mounted at / │
|
| 290 |
-
│ └─ FastAPI server (unified_incident_env.server.app) │
|
| 291 |
-
│ ├─ /reset /step /state OpenEnv contract │
|
| 292 |
-
│ ├─ /tasks /baseline /grader catalogue + scoring │
|
| 293 |
-
│ ├─ /status /health ops probes │
|
| 294 |
-
│ ├─ /metadata /schema OpenEnv metadata │
|
| 295 |
-
│ ├─ /mcp JSON-RPC 2.0 dual-route │
|
| 296 |
-
│ ├─ /docs /redoc /openapi.json Swagger / ReDoc │
|
| 297 |
-
│ └─ /info /simple legacy markdown landing │
|
| 298 |
-
│ │
|
| 299 |
-
│ sre_gym/ │
|
| 300 |
-
│ ├─ tier.py Tier enum + TierConfig │
|
| 301 |
-
│ ├─ env.py SREGym factory (delegates per t.)│
|
| 302 |
-
│ ├─ basic_runner.py wrap UnifiedIncidentEnvironment │
|
| 303 |
-
│ ├─ strategy/runner.py chain Triage episodes + horizon │
|
| 304 |
-
│ ├─ operations/runner.py Python state-machine over 22 nd. │
|
| 305 |
-
│ ├─ ui/ providers, router, policies │
|
| 306 |
-
│ ├─ local.py in-process CLI for Ollama models │
|
| 307 |
-
│ └─ exceptions.py typed errors │
|
| 308 |
-
│ │
|
| 309 |
-
│ coliseum/ parallel-rollout pool server │
|
| 310 |
-
│ ├─ server.py FastAPI lease pool │
|
| 311 |
-
│ └─ client.py ArenaClient + create_arena_client│
|
| 312 |
-
│ │
|
| 313 |
-
│ notebooks/ │
|
| 314 |
-
│ └─ 01_triage_train_grpo_qwen25_7b.ipynb SFT → GRPO pipe. │
|
| 315 |
-
│ │
|
| 316 |
-
│ skill/ Claude Code skill (Path A) │
|
| 317 |
-
│ ├─ SKILL.md agent instructions │
|
| 318 |
-
│ ├─ tools/ sre-gym HTTP client │
|
| 319 |
-
│ └─ verified-runbooks/ 12 per-template runbooks │
|
| 320 |
-
└──────────────────────────────────────────────────────────────┘
|
| 321 |
-
```
|
| 322 |
-
|
| 323 |
-
Per-tier deep dives in [`docs/TRIAGE_TIER.md`](docs/TRIAGE_TIER.md) / [`docs/STRATEGY_TIER.md`](docs/STRATEGY_TIER.md) / [`docs/OPERATIONS_TIER.md`](docs/OPERATIONS_TIER.md). Reward design: [`docs/REWARD_DESIGN.md`](docs/REWARD_DESIGN.md). Operator guide: [`execution.md`](execution.md). Architectural narrative: [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md). Blog: [`BLOG.md`](BLOG.md).
|
| 324 |
|
| 325 |
---
|
| 326 |
|
| 327 |
## Materials
|
| 328 |
|
| 329 |
-
- [`
|
| 330 |
-
- [`
|
| 331 |
-
- [`docs/`](docs/) — architecture, per-tier deep dives, reward design, scenario authoring
|
| 332 |
-
- [`docs/blog/`](docs/blog/) —
|
| 333 |
-
- [`skill/`](skill/) — Claude Code skill packaging (Path A)
|
| 334 |
-
- [`coliseum/`](coliseum/) — parallel-rollout pool server
|
| 335 |
-
- [`demo/`](demo/) — `run_demo.sh` end-to-end demo, `pitch.md` narrative
|
| 336 |
-
- [`eval/`](eval/) — held-out split definition, results directory
|
| 337 |
-
- [`train/data/`](train/data/) — teacher trajectories (Claude Opus + Llama-3.3-70B + scripted baselines + 120-episode v2 corpus)
|
| 338 |
-
- [`notebooks/`](notebooks/) — Triage SFT→GRPO training (`01_triage_train_grpo_qwen25_7b.ipynb`), eval comparison (`02_triage_eval_compare_all.ipynb`), Strategy + Operations walkthroughs
|
| 339 |
|
| 340 |
---
|
| 341 |
|
|
|
|
| 1 |
---
|
| 2 |
+
title: SystemTruth
|
| 3 |
emoji: 🚨
|
| 4 |
colorFrom: red
|
| 5 |
colorTo: yellow
|
|
|
|
| 9 |
license: apache-2.0
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# SystemTruth — a tier-escalating SRE training environment
|
| 13 |
|
| 14 |
> **Hackathon submission — OpenEnv-class, India 2026**
|
| 15 |
>
|
| 16 |
> - 📖 **Blog:** [BLOG.md](BLOG.md)
|
| 17 |
+
> - 🚀 **Live HF Space:** https://huggingface.co/spaces/Madhav189/SystemTruth
|
| 18 |
+
> - 💻 **GitHub:** https://github.com/Madhav-GPT/SystemTruth
|
| 19 |
+
> - 🧪 **Training notebook (Colab):** [](https://colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu?usp=sharing) — same as [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb)
|
| 20 |
> - 📊 **Eval results:** [`eval/results/`](eval/results/)
|
| 21 |
> - 📜 **License:** Apache 2.0
|
| 22 |
|
| 23 |
**Each tier escalates a different dimension. Triage escalates compute, Strategy escalates horizon, Operations escalates realism.** That single sentence is the load-bearing claim of the project.
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
---
|
| 26 |
|
| 27 |
+
## What's in the box (the USP — read this first)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
+
SystemTruth is **one runnable RL environment with three personas baked into it**. The same 11-action contract, the same 5-component reward rubric, the same termination shape — escalated along three orthogonal axes that map to the three real bottlenecks SRE-agent training loops actually hit.
|
| 30 |
|
| 31 |
+

|
| 32 |
|
| 33 |
+
### One environment, three tiers, three different bottlenecks
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
| Tier | Bottleneck | Persona | What it teaches |
|
| 36 |
+
|---|---|---|---|
|
| 37 |
+
| **Triage** | **Compute** | ML student / Kaggle, $30 of HF credits | causal mapping under tight context — pre-digested observations, dense reward shaping, 8K context, 11-action space, 8–13 ticks per episode |
|
| 38 |
+
| **Strategy** | **Horizon** | Seed-stage startup, $300–500 budget | long-horizon planning across chained incidents — multi-incident chains with persistent state, unresolved alerts and pending deploys carry forward, 60–90 ticks |
|
| 39 |
+
| **Operations** | **Realism** | Enterprise SRE platform, 8×A100/H100 cluster | authentic tool use against irreversible actions — 22-node service graph, 11 chaos patterns pinned to real production post-mortems, 110–180+ actions per episode |
|
|
|
|
| 40 |
|
| 41 |
+
The escalation axis is the entire pitch. Most RL environments stratify by *difficulty* (more scenarios, longer episodes, harder rewards). SystemTruth stratifies by **the dimension that actually limits the training loop for that persona**:
|
| 42 |
|
| 43 |
+
- A junior on-call learning to triage faces a different problem (cognitive efficiency under tight context) than a senior SRE running a multi-incident postmortem (state tracking across long horizons), which faces a different problem from an enterprise platform team operating against an actively chaos-engineered cluster (irreversible actions, partial observability, real wall-clock).
|
| 44 |
+
- Their training signals, episode shapes, observation richness, and reward structures should not look the same.
|
| 45 |
+
- SystemTruth takes that observation seriously and stratifies its tiers along *the dimension that actually limits the persona's training loop*.
|
| 46 |
|
| 47 |
+
### The shared 11-action contract
|
| 48 |
|
| 49 |
+
Every tier — Triage, Strategy, Operations — speaks the same eleven Pydantic-validated actions. **One contract, three escalation envelopes:**
|
| 50 |
|
| 51 |
```
|
| 52 |
query_logs(service) query_metrics(service, metric)
|
|
|
|
| 57 |
declare_resolved
|
| 58 |
```
|
| 59 |
|
| 60 |
+
A successful episode is `gather evidence → submit_hypothesis → rollback_deploy → restart_service → both run_check pass → declare_resolved`. Wrong rollback target, premature restart, or premature resolution all return negative reward and a typed `failure_type`. The contract refuses to be gamed.
|
| 61 |
|
| 62 |
+
### The episode lifecycle, illustrated
|
| 63 |
|
| 64 |
+
The lifecycle below is the Triage tier in detail; Strategy chains N of them with horizon-decay, Operations runs one of them inside a graph-mutation simulator. **The shape is shared across all three tiers** — the simulator under it is what changes.
|
| 65 |
|
| 66 |
+

|
| 67 |
|
| 68 |
+
Eleven numbered stages, each producing a measurable signal:
|
| 69 |
|
| 70 |
+
1. **`reset(scenario_id)`** — env emits the initial observation: tick counter, workflow stage, incident summary, active alerts, noise alerts (decoys), service health (cpu/mem/err/latency), user impact, SLO burn rate, checks, allowed actions.
|
| 71 |
+
2. **Evidence gathering loop** — agent calls `query_logs / query_metrics / query_dependencies / query_deploys`. After every step the env computes a per-tick **shaped reward** as a potential difference (`Δ critical_service_health × 0.55 + Δ (1 − user_impact) × 0.20 + Δ (1 − slo_burn_rate) × 0.15 + containment_applied × 0.10`) minus `step_cost`, plus `bonus`, minus `penalty`.
|
| 72 |
+
3. **`submit_hypothesis(root_cause, affected_services, confidence, recommended_next_action)`** — the world-modelling primitive. Confidence is a `float ∈ [0,1]` the agent must commit to.
|
| 73 |
+
4. **Hypothesis correctness check** — if the root cause matches truth, the agent gets an in-episode bonus up to ~0.12 (idempotent — second identical hypothesis scores 0). If wrong, the agent loops back to investigation with a new observation.
|
| 74 |
+
5. **`rollback_deploy(service)`** — the irreversible action. Wrong target = `unsafe_action_penalty` (0.08 medium / 0.12 hard). Correct target sets `cause_removed = True` and unblocks restart.
|
| 75 |
+
6. **`restart_service(service)`** — only valid if scenario requires it. Guard: if cause not removed, premature-restart penalty fires and state re-inherits the bad config.
|
| 76 |
+
7. **`run_check("end_to_end" | "database_recovery")`** — verification gate. If checks fail, agent loops back to investigation.
|
| 77 |
+
8. **`declare_resolved`** — terminal action. Guard: if checks not passed, `premature_resolution_penalty` (0.20 / 0.30) fires.
|
| 78 |
+
9. **Episode terminates** — terminal state emitted.
|
| 79 |
+
10. **Compute composite from terminal state** — the 5-component rubric below evaluates outcome / action_validity / format / anticheat / efficiency, sums to 1.0 with weighted clamping to `[0.01, 0.99]`.
|
| 80 |
+
11. **Reference scores anchor the rubric** — random `0.417` (0/36 resolved), naive heuristic `0.749` (0/12 resolved), scripted-optimal `0.938` (12/12 resolved). The 0.20 gap from `0.80 → 1.00` is what GRPO trains into.
|
| 81 |
+
|
| 82 |
+
**Cross-tier extension:**
|
| 83 |
+
- **Strategy** chains N Triage episodes, applies a `horizon_decay_factor × mean(per-phase composite)` to the final reward.
|
| 84 |
+
- **Operations** runs the same lifecycle inside a graph-mutation simulator over a 22-node service topology, same rubric, same horizon-decay weighting.
|
| 85 |
+
|
| 86 |
+
### Reward rubric — the engineering crown jewel
|
| 87 |
|
| 88 |
```
|
| 89 |
final_reward = 0.45·outcome
|
|
|
|
| 101 |
efficiency exp(-current_tick / optimal_ticks_for_template)
|
| 102 |
```
|
| 103 |
|
| 104 |
+
Each component defends against a specific cheat:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
| Cheat strategy | Blocked by |
|
| 107 |
|---|---|
|
|
|
|
| 126 |
-1.0 confident-wrong
|
| 127 |
```
|
| 128 |
|
| 129 |
+
The `confidence ∈ [0,1]` field is part of the structured `HypothesisPayload` Pydantic model the agent emits; the calibration sub-term reads it directly. **A model that bluffs high confidence on a wrong root cause is worse than one that hedges.** That's the world-modelling primitive — the env grades the agent's belief, not just its prediction.
|
| 130 |
+
|
| 131 |
+
Two CI invariants pin the rubric in place on every commit:
|
| 132 |
+
- **Heuristic ceiling `[0.65, 0.80]`** — `test_heuristic_ceiling_is_in_band` enforces this band on every template. The 0.20 gap from 0.80 → 1.00 is the GRPO training target.
|
| 133 |
+
- **Scripted-expert floor `≥0.90`** — `test_round2_baseline_resolves` enforces ≥0.90 on the round-2 templates.
|
| 134 |
+
|
| 135 |
+
Tighten either side and the gradient signal collapses; loosen either side and a memorising model can game the rubric. The band is the load-bearing engineering claim.
|
| 136 |
|
| 137 |
---
|
| 138 |
|
| 139 |
## Coliseum — parallel-rollout pool server
|
| 140 |
|
| 141 |
+
[`coliseum/`](coliseum/) wraps the Triage env in a lease-based HTTP contract so a GRPO trainer's parallel-rollout side can drive the env without holding an in-process Python instance per worker:
|
| 142 |
|
| 143 |
```
|
| 144 |
allocate(task_key) -> {ok: true, lease_id}
|
|
|
|
| 164 |
|
| 165 |
## Training & datasets — the honest weak point
|
| 166 |
|
| 167 |
+
The environment is the project. Training scripts orbit around it. We ran a real end-to-end Triage run on the day of the deadline — pipeline works, results are below the heuristic plateau, the gap is real, and we're saying so.
|
| 168 |
|
| 169 |
### What we ran
|
| 170 |
|
| 171 |
+
Pipeline lives in [`notebooks/01_triage_train_grpo_qwen25_7b.ipynb`](notebooks/01_triage_train_grpo_qwen25_7b.ipynb) — also openable in Colab via the badge: [](https://colab.research.google.com/drive/1Re9pwkabEP4Cearc2hCMMqGdEjSSUjGu?usp=sharing). Target A100 80GB, ~2-3h end-to-end:
|
| 172 |
|
| 173 |
1. **SFT cold-start** — Unsloth + Qwen2.5-7B-Instruct (4-bit) + LoRA r=32, 50 steps × batch 16 on a 999-example diverse trajectory corpus. Eval perplexity 1.755 (healthy `[1.5, 3.0]` band). Saved to `outputs/qwen25_7b_sft_final/`.
|
| 174 |
2. **GRPO online** — TRL's `GRPOTrainer`, 40 steps × K=2 rollouts. Reward = composite + scenario-aware first-action bonus. Saved to `outputs/qwen25_7b_grpo_final/`.
|
|
|
|
| 176 |
|
| 177 |
### What it produced
|
| 178 |
|
| 179 |
+

|
| 180 |
+
|
| 181 |
| policy | mean | median | p25 | p75 | resolved_rate |
|
| 182 |
|---|---|---|---|---|---|
|
| 183 |
| random | 0.342 | 0.378 | 0.340 | 0.380 | 0/36 |
|
|
|
|
| 186 |
| heuristic (queries + correct hypothesis, no remediation) | 0.704 | 0.705 | 0.703 | 0.705 | 0/36 |
|
| 187 |
| scripted-optimal | 0.938 | 0.939 | 0.937 | 0.940 | 36/36 |
|
| 188 |
|
| 189 |
+

|
| 190 |
+
|
| 191 |
Honest reading:
|
| 192 |
- SFT lifted the model 11% above random (0.342 → 0.379). Format-learning worked: the trained model produces 100% schema-valid `action_type` JSON.
|
| 193 |
- GRPO did not move the mean further on K=2 / 40 steps / 7B. The advantage signal exists but the budget is too small to overcome the heuristic plateau at 0.704.
|
|
|
|
| 201 |
|
| 202 |
---
|
| 203 |
|
| 204 |
+
## Quickstart
|
| 205 |
+
|
| 206 |
+
### 5-minute local demo (no API keys, no server, no GPU)
|
| 207 |
+
|
| 208 |
+
```bash
|
| 209 |
+
pip install -e .
|
| 210 |
+
ollama pull llama3.2
|
| 211 |
+
python -m sre_gym.local triage worker_deploy_cascade
|
| 212 |
+
```
|
| 213 |
+
|
| 214 |
+
The CLI drives `UnifiedIncidentEnvironment` directly and prints per-tick reward, the 5-component score breakdown, and a final summary.
|
| 215 |
+
|
| 216 |
+
### Live HF Space (Triage tier, hosted)
|
| 217 |
+
|
| 218 |
+
Open https://huggingface.co/spaces/Madhav189/SystemTruth. Pick a tier, paste an HF token, click **▶ run eval**. Each tick streams the action, env response, reward delta, and the 5-component breakdown.
|
| 219 |
+
|
| 220 |
+
### Local server + Gradio UI
|
| 221 |
+
|
| 222 |
+
```bash
|
| 223 |
+
make install
|
| 224 |
+
make dev # FastAPI + Gradio on :7860
|
| 225 |
+
python -m sre_gym.strategy run cascading_release_train
|
| 226 |
+
python -m sre_gym.operations run ecommerce_vibecoded_saas --chaos rls_silent_leak
|
| 227 |
+
```
|
| 228 |
+
|
| 229 |
+
The FastAPI server speaks the OpenEnv contract (`/reset /step /state /tasks /baseline /grader /status /health /metadata /schema`) plus an MCP JSON-RPC route at `/mcp`.
|
| 230 |
+
|
| 231 |
+
---
|
| 232 |
+
|
| 233 |
## Two-paths agent design
|
| 234 |
|
| 235 |
The repo ships **two independent paths** to a working SRE agent. They share the env contract but trade compute for capability differently.
|
|
|
|
| 243 |
bash demo/run_demo.sh # end-to-end demo
|
| 244 |
```
|
| 245 |
|
|
|
|
|
|
|
| 246 |
### Path B — GRPO-trained adapter (one A100, ~2–3h on 7B)
|
| 247 |
|
| 248 |
The training pipeline above. Path A is what an agent ships *today*; Path B is what raises the floor on the templates it sees over and over.
|
| 249 |
|
| 250 |
---
|
| 251 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 252 |
## Tier-aware Python API
|
| 253 |
|
| 254 |
```python
|
|
|
|
| 278 |
|
| 279 |
```bash
|
| 280 |
make test # green at HEAD
|
| 281 |
+
ruff check .
|
| 282 |
openenv validate . # green
|
| 283 |
```
|
| 284 |
|
| 285 |
The two CI invariants that keep the rubric calibrated:
|
| 286 |
|
| 287 |
- `test_heuristic_ceiling_is_in_band` — naive heuristic in `[0.65, 0.80]` on every template.
|
| 288 |
+
- `test_round2_baseline_resolves` — scripted-optimal `≥ 0.90` on the round-2 templates.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 289 |
|
| 290 |
---
|
| 291 |
|
| 292 |
## Materials
|
| 293 |
|
| 294 |
+
- [`BLOG.md`](BLOG.md) — the hackathon blog (with all 6 assets in `docs/blog/`)
|
| 295 |
+
- [`openenv.yaml`](openenv.yaml) — declares the three tiers, runnable kinds, scenario counts
|
| 296 |
+
- [`docs/`](docs/) — architecture, per-tier deep dives, reward design, scenario authoring
|
| 297 |
+
- [`docs/blog/`](docs/blog/) — visuals: lifecycle, architecture, hero, topology, rubric donut, chaos timeline, two-paths, baselines bar
|
| 298 |
+
- [`skill/`](skill/) — Claude Code skill packaging (Path A)
|
| 299 |
+
- [`coliseum/`](coliseum/) — parallel-rollout pool server
|
| 300 |
+
- [`demo/`](demo/) — `run_demo.sh` end-to-end demo, `pitch.md` narrative
|
| 301 |
+
- [`eval/`](eval/) — held-out split definition, results directory with the latest eval CSV + plots
|
| 302 |
+
- [`train/data/`](train/data/) — teacher trajectories (Claude Opus + Llama-3.3-70B + scripted baselines + 120-episode v2 corpus)
|
| 303 |
+
- [`notebooks/`](notebooks/) — Triage SFT→GRPO training (`01_triage_train_grpo_qwen25_7b.ipynb`), eval comparison (`02_triage_eval_compare_all.ipynb`), Strategy + Operations walkthroughs
|
| 304 |
|
| 305 |
---
|
| 306 |
|
|
@@ -84,9 +84,9 @@ TIER_DEFAULT_MODEL: dict[str, str] = {
|
|
| 84 |
|
| 85 |
|
| 86 |
TIER_DESCRIPTION: dict[str, str] = {
|
| 87 |
-
"basic": "escalates compute · 12 templates × 5 procgen variants · single bounded incident",
|
| 88 |
-
"advanced": "escalates horizon · chained incidents · persistent state across episodes",
|
| 89 |
-
"max": "escalates realism · 22-service ecommerce sim · 11 chaos patterns",
|
| 90 |
}
|
| 91 |
|
| 92 |
|
|
@@ -334,36 +334,41 @@ gradio-app::before {
|
|
| 334 |
}
|
| 335 |
.sg-panel-label::before { content: '▸'; color: var(--brand); }
|
| 336 |
|
| 337 |
-
/* ─── INPUTS — token / model / provider key ─────────
|
| 338 |
.sg-panel-col .form, .sg-panel-col .block { background: transparent !important; }
|
| 339 |
.sg-panel-col input,
|
| 340 |
.sg-panel-col textarea,
|
| 341 |
.sg-panel-col select {
|
| 342 |
-
background:
|
| 343 |
-
border: 1px solid var(--border) !important;
|
| 344 |
color: var(--text-primary) !important;
|
| 345 |
-
font-family: var(--mono) !important;
|
| 346 |
-
|
|
|
|
|
|
|
| 347 |
box-shadow: none !important;
|
|
|
|
| 348 |
}
|
| 349 |
.sg-panel-col input:focus,
|
| 350 |
.sg-panel-col textarea:focus,
|
| 351 |
.sg-panel-col select:focus {
|
| 352 |
-
border-color: var(--
|
|
|
|
|
|
|
| 353 |
}
|
| 354 |
.sg-panel-col input::placeholder, .sg-panel-col textarea::placeholder {
|
| 355 |
-
color: var(--text-
|
| 356 |
}
|
| 357 |
/* Field labels — Gradio renders <label><span>LABEL</span> ...</label> */
|
| 358 |
.sg-panel-col label > span:first-child,
|
| 359 |
.sg-panel-col .label-wrap > span,
|
| 360 |
.sg-panel-col .label-wrap span {
|
| 361 |
color: var(--text-secondary) !important;
|
| 362 |
-
font-size:
|
| 363 |
-
letter-spacing: 0.
|
| 364 |
text-transform: uppercase !important;
|
| 365 |
-
font-weight:
|
| 366 |
-
margin-bottom:
|
| 367 |
}
|
| 368 |
.sg-panel-col label { background: transparent !important; }
|
| 369 |
|
|
@@ -371,16 +376,17 @@ gradio-app::before {
|
|
| 371 |
.sg-panel-col .dropdown,
|
| 372 |
.sg-panel-col .wrap-inner,
|
| 373 |
.sg-panel-col .options {
|
| 374 |
-
background:
|
| 375 |
-
border: 1px solid var(--border) !important;
|
| 376 |
color: var(--text-primary) !important;
|
|
|
|
| 377 |
}
|
| 378 |
.sg-panel-col .dropdown ul li:hover,
|
| 379 |
.sg-panel-col .options li:hover {
|
| 380 |
background: var(--bg-input-hover) !important;
|
| 381 |
}
|
| 382 |
|
| 383 |
-
/* ─── TIER CARDS — 3 styled buttons ───
|
| 384 |
.sg-tier-list, .sg-tier-list .form, .sg-tier-list .gap {
|
| 385 |
display: flex !important; flex-direction: column !important; gap: 8px !important;
|
| 386 |
background: transparent !important;
|
|
@@ -389,14 +395,15 @@ gradio-app::before {
|
|
| 389 |
.sg-tier-card button {
|
| 390 |
display: block !important;
|
| 391 |
padding: 14px 16px !important;
|
| 392 |
-
background:
|
| 393 |
-
border: 1px solid var(--border) !important;
|
| 394 |
color: var(--text-secondary) !important;
|
| 395 |
font-family: var(--mono) !important; font-size: 11.5px !important;
|
| 396 |
font-weight: 400 !important;
|
| 397 |
text-align: left !important; cursor: pointer !important;
|
| 398 |
width: 100% !important; min-height: auto !important;
|
| 399 |
-
border-radius:
|
|
|
|
| 400 |
transition: all 0.15s ease !important;
|
| 401 |
white-space: pre-line !important;
|
| 402 |
line-height: 1.55 !important;
|
|
@@ -412,16 +419,16 @@ gradio-app::before {
|
|
| 412 |
line-height: 2 !important;
|
| 413 |
}
|
| 414 |
.sg-tier-card button:hover {
|
| 415 |
-
background:
|
| 416 |
-
border-color: var(--border-
|
| 417 |
}
|
| 418 |
.sg-tier-card-selected button {
|
| 419 |
-
background: rgba(
|
| 420 |
-
border-color: var(--
|
| 421 |
-
box-shadow: inset
|
| 422 |
}
|
| 423 |
.sg-tier-card-selected button::first-line {
|
| 424 |
-
color: var(--
|
| 425 |
}
|
| 426 |
|
| 427 |
/* ─── TERMINAL ────────────────────────────────────────────────────────── */
|
|
@@ -455,13 +462,15 @@ gradio-app::before {
|
|
| 455 |
.sg-chrome-status .em { color: var(--text-primary); font-weight: 500; }
|
| 456 |
.sg-chrome-meta { color: var(--text-dim); font-size: 11px; }
|
| 457 |
.sg-terminal-body {
|
| 458 |
-
padding:
|
| 459 |
-
font-size: 12.5px; line-height: 1.
|
| 460 |
white-space: pre; overflow-x: auto;
|
| 461 |
background: var(--bg-panel);
|
| 462 |
background-image: linear-gradient(transparent 50%, rgba(255, 255, 255, 0.012) 50%);
|
| 463 |
background-size: 100% 3px;
|
| 464 |
-
min-height:
|
|
|
|
|
|
|
| 465 |
color: var(--text-primary);
|
| 466 |
}
|
| 467 |
.sg-terminal-body .ts { color: var(--timestamp); }
|
|
@@ -480,50 +489,60 @@ gradio-app::before {
|
|
| 480 |
}
|
| 481 |
@keyframes sg-blink { 50% { opacity: 0; } }
|
| 482 |
|
| 483 |
-
/* ─── CONTROLS ROW ──
|
|
|
|
|
|
|
| 484 |
.sg-controls-row {
|
| 485 |
-
padding:
|
| 486 |
background: var(--bg-panel) !important;
|
| 487 |
border: 1px solid var(--border) !important;
|
| 488 |
margin-bottom: 16px !important;
|
| 489 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 490 |
}
|
| 491 |
-
.sg-btn-group { gap: 8px !important; flex-wrap: nowrap !important; }
|
| 492 |
.sg-btn-primary, .sg-btn-secondary {
|
| 493 |
flex: 0 0 auto !important; min-width: auto !important;
|
| 494 |
}
|
| 495 |
.sg-btn-primary button, .sg-btn-secondary button {
|
| 496 |
font-family: var(--mono) !important; font-size: 12px !important;
|
| 497 |
-
font-weight:
|
| 498 |
text-transform: uppercase !important;
|
| 499 |
-
padding:
|
|
|
|
| 500 |
box-shadow: none !important; min-height: auto !important;
|
| 501 |
cursor: pointer !important; transition: all 0.15s ease !important;
|
| 502 |
}
|
| 503 |
.sg-btn-primary button {
|
| 504 |
-
background: rgba(
|
| 505 |
-
border: 1px solid var(--
|
| 506 |
-
color: var(--
|
| 507 |
}
|
| 508 |
-
.sg-btn-primary button:hover { background: rgba(
|
| 509 |
.sg-btn-secondary button {
|
| 510 |
-
background:
|
| 511 |
border: 1px solid var(--border-strong) !important;
|
| 512 |
color: var(--text-primary) !important;
|
| 513 |
}
|
| 514 |
.sg-btn-secondary button:hover {
|
| 515 |
-
background:
|
| 516 |
border-color: var(--border-focus) !important;
|
| 517 |
}
|
| 518 |
|
| 519 |
-
/* ─── METRICS BAR ─────────────────────
|
| 520 |
-
|
| 521 |
.sg-metrics-host > div, .sg-metrics-host .prose { background: transparent !important; }
|
| 522 |
.sg-metrics {
|
| 523 |
display: flex !important; align-items: center !important;
|
| 524 |
gap: 24px !important; flex-wrap: wrap !important;
|
| 525 |
color: var(--text-secondary) !important; font-size: 11px !important;
|
| 526 |
-
padding:
|
| 527 |
}
|
| 528 |
.sg-metric {
|
| 529 |
display: flex !important; gap: 6px !important; align-items: center !important;
|
|
@@ -536,7 +555,7 @@ gradio-app::before {
|
|
| 536 |
color: var(--text-primary) !important; font-weight: 600 !important;
|
| 537 |
}
|
| 538 |
.sg-metric .value.r { color: var(--reward) !important; }
|
| 539 |
-
.sg-metric .value.s { color: var(--
|
| 540 |
.sg-rubric {
|
| 541 |
display: flex !important; align-items: center !important; gap: 14px !important;
|
| 542 |
padding-left: 18px !important; margin-left: 4px !important;
|
|
@@ -558,7 +577,7 @@ gradio-app::before {
|
|
| 558 |
height: 3px !important; background: var(--bg-input) !important;
|
| 559 |
overflow: hidden !important; margin-top: 2px !important;
|
| 560 |
}
|
| 561 |
-
.sg-rubric-bar > div { height: 100% !important; background: var(--
|
| 562 |
|
| 563 |
/* ─── TIER DESCRIPTION (under the cards) ──────────────────────────────── */
|
| 564 |
.sg-tier-desc, .sg-tier-desc * {
|
|
@@ -603,17 +622,18 @@ def _header_html() -> str:
|
|
| 603 |
return f"""
|
| 604 |
<header class="sg-header">
|
| 605 |
<div class="sg-brand-block">
|
| 606 |
-
<div class="sg-brand-mark">
|
| 607 |
<div class="sg-brand-tagline">
|
| 608 |
<em>tier-escalating SRE RL env</em> ·
|
| 609 |
-
|
| 610 |
</div>
|
| 611 |
</div>
|
| 612 |
<nav class="sg-nav">
|
| 613 |
<span class="sg-status-dot">env online</span>
|
| 614 |
<a href="/docs" target="_blank" rel="noopener">api docs</a>
|
| 615 |
<a href="/mcp/tools" target="_blank" rel="noopener">mcp tools</a>
|
| 616 |
-
<a href="/
|
|
|
|
| 617 |
</nav>
|
| 618 |
</header>
|
| 619 |
"""
|
|
@@ -651,11 +671,11 @@ FOOTER_HTML = """
|
|
| 651 |
<div>
|
| 652 |
built for the openenv hackathon · india apr '26
|
| 653 |
·
|
| 654 |
-
<a href="https://github.com/Madhav-GPT/
|
| 655 |
·
|
| 656 |
-
<a href="https://huggingface.co/spaces/Madhav189/
|
| 657 |
·
|
| 658 |
-
<a href="https://github.com/Madhav-GPT/
|
| 659 |
</div>
|
| 660 |
<div>multi-rubric reward · RLVE procgen · MCP dual-route</div>
|
| 661 |
</footer>
|
|
@@ -1231,28 +1251,29 @@ def build_app() -> gr.Blocks:
|
|
| 1231 |
# ── terminal pane ──────────────────────────────────────────
|
| 1232 |
terminal = gr.HTML(_initial_terminal_html(), elem_id="sg-terminal-host")
|
| 1233 |
|
| 1234 |
-
# ── controls + metrics
|
| 1235 |
-
|
| 1236 |
-
|
| 1237 |
-
|
| 1238 |
-
|
| 1239 |
-
|
| 1240 |
-
|
| 1241 |
-
|
| 1242 |
-
|
| 1243 |
-
|
| 1244 |
-
|
| 1245 |
-
|
| 1246 |
-
|
| 1247 |
-
|
| 1248 |
-
"↻ RESET",
|
| 1249 |
-
elem_classes=["sg-btn-secondary"],
|
| 1250 |
-
)
|
| 1251 |
-
with gr.Column(scale=1):
|
| 1252 |
-
metrics = gr.HTML(
|
| 1253 |
-
_metric_bar_html(),
|
| 1254 |
-
elem_classes=["sg-metrics-host"],
|
| 1255 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1256 |
|
| 1257 |
gr.HTML(FOOTER_HTML)
|
| 1258 |
|
|
|
|
| 84 |
|
| 85 |
|
| 86 |
TIER_DESCRIPTION: dict[str, str] = {
|
| 87 |
+
"basic": "Triage tier · escalates compute · 12 templates × 5 procgen variants · single bounded incident",
|
| 88 |
+
"advanced": "Strategy tier · escalates horizon · chained incidents · persistent state across episodes",
|
| 89 |
+
"max": "Operations tier · escalates realism · 22-service ecommerce sim · 11 chaos patterns",
|
| 90 |
}
|
| 91 |
|
| 92 |
|
|
|
|
| 334 |
}
|
| 335 |
.sg-panel-label::before { content: '▸'; color: var(--brand); }
|
| 336 |
|
| 337 |
+
/* ─── INPUTS — token / model / provider key (LIGHTER + BIGGER) ───────── */
|
| 338 |
.sg-panel-col .form, .sg-panel-col .block { background: transparent !important; }
|
| 339 |
.sg-panel-col input,
|
| 340 |
.sg-panel-col textarea,
|
| 341 |
.sg-panel-col select {
|
| 342 |
+
background: #1f2630 !important; /* lighter than the panel */
|
| 343 |
+
border: 1px solid var(--border-strong) !important;
|
| 344 |
color: var(--text-primary) !important;
|
| 345 |
+
font-family: var(--mono) !important;
|
| 346 |
+
font-size: 13px !important; /* was 12 */
|
| 347 |
+
padding: 12px 14px !important; /* was 8/10 */
|
| 348 |
+
border-radius: 4px !important; /* was 0 — softer, more usable */
|
| 349 |
box-shadow: none !important;
|
| 350 |
+
min-height: 42px !important; /* taller for usability */
|
| 351 |
}
|
| 352 |
.sg-panel-col input:focus,
|
| 353 |
.sg-panel-col textarea:focus,
|
| 354 |
.sg-panel-col select:focus {
|
| 355 |
+
border-color: var(--brand) !important; /* phosphor accent on focus */
|
| 356 |
+
outline: none !important;
|
| 357 |
+
box-shadow: 0 0 0 1px rgba(126, 231, 135, 0.25) !important;
|
| 358 |
}
|
| 359 |
.sg-panel-col input::placeholder, .sg-panel-col textarea::placeholder {
|
| 360 |
+
color: var(--text-dim) !important; /* was --text-faint */
|
| 361 |
}
|
| 362 |
/* Field labels — Gradio renders <label><span>LABEL</span> ...</label> */
|
| 363 |
.sg-panel-col label > span:first-child,
|
| 364 |
.sg-panel-col .label-wrap > span,
|
| 365 |
.sg-panel-col .label-wrap span {
|
| 366 |
color: var(--text-secondary) !important;
|
| 367 |
+
font-size: 11px !important;
|
| 368 |
+
letter-spacing: 0.14em !important;
|
| 369 |
text-transform: uppercase !important;
|
| 370 |
+
font-weight: 600 !important;
|
| 371 |
+
margin-bottom: 6px !important;
|
| 372 |
}
|
| 373 |
.sg-panel-col label { background: transparent !important; }
|
| 374 |
|
|
|
|
| 376 |
.sg-panel-col .dropdown,
|
| 377 |
.sg-panel-col .wrap-inner,
|
| 378 |
.sg-panel-col .options {
|
| 379 |
+
background: #1f2630 !important;
|
| 380 |
+
border: 1px solid var(--border-strong) !important;
|
| 381 |
color: var(--text-primary) !important;
|
| 382 |
+
border-radius: 4px !important;
|
| 383 |
}
|
| 384 |
.sg-panel-col .dropdown ul li:hover,
|
| 385 |
.sg-panel-col .options li:hover {
|
| 386 |
background: var(--bg-input-hover) !important;
|
| 387 |
}
|
| 388 |
|
| 389 |
+
/* ─── TIER CARDS — 3 styled buttons (theme-cohesive phosphor accent) ─── */
|
| 390 |
.sg-tier-list, .sg-tier-list .form, .sg-tier-list .gap {
|
| 391 |
display: flex !important; flex-direction: column !important; gap: 8px !important;
|
| 392 |
background: transparent !important;
|
|
|
|
| 395 |
.sg-tier-card button {
|
| 396 |
display: block !important;
|
| 397 |
padding: 14px 16px !important;
|
| 398 |
+
background: #1f2630 !important; /* match input bg */
|
| 399 |
+
border: 1px solid var(--border-strong) !important;
|
| 400 |
color: var(--text-secondary) !important;
|
| 401 |
font-family: var(--mono) !important; font-size: 11.5px !important;
|
| 402 |
font-weight: 400 !important;
|
| 403 |
text-align: left !important; cursor: pointer !important;
|
| 404 |
width: 100% !important; min-height: auto !important;
|
| 405 |
+
border-radius: 4px !important;
|
| 406 |
+
box-shadow: none !important;
|
| 407 |
transition: all 0.15s ease !important;
|
| 408 |
white-space: pre-line !important;
|
| 409 |
line-height: 1.55 !important;
|
|
|
|
| 419 |
line-height: 2 !important;
|
| 420 |
}
|
| 421 |
.sg-tier-card button:hover {
|
| 422 |
+
background: #252d38 !important;
|
| 423 |
+
border-color: var(--border-focus) !important;
|
| 424 |
}
|
| 425 |
.sg-tier-card-selected button {
|
| 426 |
+
background: rgba(126, 231, 135, 0.06) !important; /* phosphor wash */
|
| 427 |
+
border-color: var(--brand) !important;
|
| 428 |
+
box-shadow: inset 3px 0 0 var(--brand) !important;
|
| 429 |
}
|
| 430 |
.sg-tier-card-selected button::first-line {
|
| 431 |
+
color: var(--brand) !important; /* matches header brand */
|
| 432 |
}
|
| 433 |
|
| 434 |
/* ─── TERMINAL ────────────────────────────────────────────────────────── */
|
|
|
|
| 462 |
.sg-chrome-status .em { color: var(--text-primary); font-weight: 500; }
|
| 463 |
.sg-chrome-meta { color: var(--text-dim); font-size: 11px; }
|
| 464 |
.sg-terminal-body {
|
| 465 |
+
padding: 16px 20px 18px;
|
| 466 |
+
font-size: 12.5px; line-height: 1.65;
|
| 467 |
white-space: pre; overflow-x: auto;
|
| 468 |
background: var(--bg-panel);
|
| 469 |
background-image: linear-gradient(transparent 50%, rgba(255, 255, 255, 0.012) 50%);
|
| 470 |
background-size: 100% 3px;
|
| 471 |
+
min-height: 280px; /* was 480 — visible above the fold */
|
| 472 |
+
max-height: 56vh; /* still scrolls if a long run */
|
| 473 |
+
overflow-y: auto;
|
| 474 |
color: var(--text-primary);
|
| 475 |
}
|
| 476 |
.sg-terminal-body .ts { color: var(--timestamp); }
|
|
|
|
| 489 |
}
|
| 490 |
@keyframes sg-blink { 50% { opacity: 0; } }
|
| 491 |
|
| 492 |
+
/* ─── CONTROLS ROW — stacks vertically: buttons on top, metrics below ── */
|
| 493 |
+
/* Now a gr.Column wrapped with this class — Gradio gives us flex-direction:
|
| 494 |
+
column for free, but we still pin it for browsers that style differently. */
|
| 495 |
.sg-controls-row {
|
| 496 |
+
padding: 16px 18px !important;
|
| 497 |
background: var(--bg-panel) !important;
|
| 498 |
border: 1px solid var(--border) !important;
|
| 499 |
margin-bottom: 16px !important;
|
| 500 |
+
display: flex !important;
|
| 501 |
+
flex-direction: column !important;
|
| 502 |
+
gap: 14px !important;
|
| 503 |
+
align-items: stretch !important;
|
| 504 |
+
}
|
| 505 |
+
.sg-btn-group {
|
| 506 |
+
gap: 10px !important;
|
| 507 |
+
flex-wrap: wrap !important; /* on narrow screens buttons wrap rather than overflow */
|
| 508 |
+
justify-content: flex-start !important;
|
| 509 |
}
|
|
|
|
| 510 |
.sg-btn-primary, .sg-btn-secondary {
|
| 511 |
flex: 0 0 auto !important; min-width: auto !important;
|
| 512 |
}
|
| 513 |
.sg-btn-primary button, .sg-btn-secondary button {
|
| 514 |
font-family: var(--mono) !important; font-size: 12px !important;
|
| 515 |
+
font-weight: 700 !important; letter-spacing: 0.08em !important;
|
| 516 |
text-transform: uppercase !important;
|
| 517 |
+
padding: 11px 22px !important; /* a touch bigger so it stands alone on its row */
|
| 518 |
+
border-radius: 4px !important;
|
| 519 |
box-shadow: none !important; min-height: auto !important;
|
| 520 |
cursor: pointer !important; transition: all 0.15s ease !important;
|
| 521 |
}
|
| 522 |
.sg-btn-primary button {
|
| 523 |
+
background: rgba(126, 231, 135, 0.10) !important;
|
| 524 |
+
border: 1px solid var(--brand) !important;
|
| 525 |
+
color: var(--brand) !important;
|
| 526 |
}
|
| 527 |
+
.sg-btn-primary button:hover { background: rgba(126, 231, 135, 0.18) !important; }
|
| 528 |
.sg-btn-secondary button {
|
| 529 |
+
background: #1f2630 !important;
|
| 530 |
border: 1px solid var(--border-strong) !important;
|
| 531 |
color: var(--text-primary) !important;
|
| 532 |
}
|
| 533 |
.sg-btn-secondary button:hover {
|
| 534 |
+
background: #252d38 !important;
|
| 535 |
border-color: var(--border-focus) !important;
|
| 536 |
}
|
| 537 |
|
| 538 |
+
/* ─── METRICS BAR (now sits under the run buttons) ───────────────────── */
|
| 539 |
+
.sg-metrics-host { padding-top: 8px !important; border-top: 1px solid var(--border) !important; }
|
| 540 |
.sg-metrics-host > div, .sg-metrics-host .prose { background: transparent !important; }
|
| 541 |
.sg-metrics {
|
| 542 |
display: flex !important; align-items: center !important;
|
| 543 |
gap: 24px !important; flex-wrap: wrap !important;
|
| 544 |
color: var(--text-secondary) !important; font-size: 11px !important;
|
| 545 |
+
padding: 6px 0 0 !important;
|
| 546 |
}
|
| 547 |
.sg-metric {
|
| 548 |
display: flex !important; gap: 6px !important; align-items: center !important;
|
|
|
|
| 555 |
color: var(--text-primary) !important; font-weight: 600 !important;
|
| 556 |
}
|
| 557 |
.sg-metric .value.r { color: var(--reward) !important; }
|
| 558 |
+
.sg-metric .value.s { color: var(--brand) !important; } /* phosphor — theme cohesion */
|
| 559 |
.sg-rubric {
|
| 560 |
display: flex !important; align-items: center !important; gap: 14px !important;
|
| 561 |
padding-left: 18px !important; margin-left: 4px !important;
|
|
|
|
| 577 |
height: 3px !important; background: var(--bg-input) !important;
|
| 578 |
overflow: hidden !important; margin-top: 2px !important;
|
| 579 |
}
|
| 580 |
+
.sg-rubric-bar > div { height: 100% !important; background: var(--brand) !important; }
|
| 581 |
|
| 582 |
/* ─── TIER DESCRIPTION (under the cards) ──────────────────────────────── */
|
| 583 |
.sg-tier-desc, .sg-tier-desc * {
|
|
|
|
| 622 |
return f"""
|
| 623 |
<header class="sg-header">
|
| 624 |
<div class="sg-brand-block">
|
| 625 |
+
<div class="sg-brand-mark">SystemTruth<span>//</span></div>
|
| 626 |
<div class="sg-brand-tagline">
|
| 627 |
<em>tier-escalating SRE RL env</em> ·
|
| 628 |
+
Triage / Strategy / Operations · {THEME_TAGLINE}
|
| 629 |
</div>
|
| 630 |
</div>
|
| 631 |
<nav class="sg-nav">
|
| 632 |
<span class="sg-status-dot">env online</span>
|
| 633 |
<a href="/docs" target="_blank" rel="noopener">api docs</a>
|
| 634 |
<a href="/mcp/tools" target="_blank" rel="noopener">mcp tools</a>
|
| 635 |
+
<a href="https://github.com/Madhav-GPT/SystemTruth" target="_blank" rel="noopener">github</a>
|
| 636 |
+
<a href="https://github.com/Madhav-GPT/SystemTruth/blob/main/BLOG.md" target="_blank" rel="noopener">blog</a>
|
| 637 |
</nav>
|
| 638 |
</header>
|
| 639 |
"""
|
|
|
|
| 671 |
<div>
|
| 672 |
built for the openenv hackathon · india apr '26
|
| 673 |
·
|
| 674 |
+
<a href="https://github.com/Madhav-GPT/SystemTruth" target="_blank">github</a>
|
| 675 |
·
|
| 676 |
+
<a href="https://huggingface.co/spaces/Madhav189/SystemTruth" target="_blank">hf space</a>
|
| 677 |
·
|
| 678 |
+
<a href="https://github.com/Madhav-GPT/SystemTruth/blob/main/BLOG.md" target="_blank">blog</a>
|
| 679 |
</div>
|
| 680 |
<div>multi-rubric reward · RLVE procgen · MCP dual-route</div>
|
| 681 |
</footer>
|
|
|
|
| 1251 |
# ── terminal pane ──────────────────────────────────────────
|
| 1252 |
terminal = gr.HTML(_initial_terminal_html(), elem_id="sg-terminal-host")
|
| 1253 |
|
| 1254 |
+
# ── controls + metrics — stacked vertically (buttons on top, ──
|
| 1255 |
+
# metrics below). Using a single Column with two children means
|
| 1256 |
+
# the metrics bar gets the full width on its own row instead of
|
| 1257 |
+
# fighting the buttons for horizontal space.
|
| 1258 |
+
with gr.Column(elem_classes=["sg-controls-row"]):
|
| 1259 |
+
with gr.Row(elem_classes=["sg-btn-group"]):
|
| 1260 |
+
run_btn = gr.Button(
|
| 1261 |
+
"▶ RUN EVAL",
|
| 1262 |
+
variant="primary",
|
| 1263 |
+
elem_classes=["sg-btn-primary"],
|
| 1264 |
+
)
|
| 1265 |
+
stop_btn = gr.Button(
|
| 1266 |
+
"■ STOP",
|
| 1267 |
+
elem_classes=["sg-btn-secondary"],
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1268 |
)
|
| 1269 |
+
reset_btn = gr.Button(
|
| 1270 |
+
"↻ RESET",
|
| 1271 |
+
elem_classes=["sg-btn-secondary"],
|
| 1272 |
+
)
|
| 1273 |
+
metrics = gr.HTML(
|
| 1274 |
+
_metric_bar_html(),
|
| 1275 |
+
elem_classes=["sg-metrics-host"],
|
| 1276 |
+
)
|
| 1277 |
|
| 1278 |
gr.HTML(FOOTER_HTML)
|
| 1279 |
|
|
@@ -179,7 +179,7 @@ End-to-end: ~2-3h on a single A100 80GB, ~$5–8 of HF compute credits.
|
|
| 179 |
- ✅ 12 templates × 5 procgen variants = 72 scenarios live
|
| 180 |
- ✅ Pytest suite green
|
| 181 |
- ✅ `openenv validate .` green
|
| 182 |
-
- ✅ HF Space deployed at `Madhav189-
|
| 183 |
- ✅ Training notebook runs end-to-end on A100 80GB
|
| 184 |
- ✅ Eval comparison cell produces hero bar + per-template chart
|
| 185 |
- ✅ Trained Qwen2.5-7B adapter saved to `outputs/qwen25_7b_grpo_final/`
|
|
|
|
| 179 |
- ✅ 12 templates × 5 procgen variants = 72 scenarios live
|
| 180 |
- ✅ Pytest suite green
|
| 181 |
- ✅ `openenv validate .` green
|
| 182 |
+
- ✅ HF Space deployed at `Madhav189-SystemTruth.hf.space`
|
| 183 |
- ✅ Training notebook runs end-to-end on A100 80GB
|
| 184 |
- ✅ Eval comparison cell produces hero bar + per-template chart
|
| 185 |
- ✅ Trained Qwen2.5-7B adapter saved to `outputs/qwen25_7b_grpo_final/`
|
|
Git LFS Details
|
|
Git LFS Details
|
|
Git LFS Details
|
|
Git LFS Details
|
|
@@ -67,7 +67,7 @@ The honest framing: **the env is the project, the rubric is the engineering crow
|
|
| 67 |
## 2. Local setup
|
| 68 |
|
| 69 |
```bash
|
| 70 |
-
git clone https://github.com/Madhav-GPT/
|
| 71 |
cd sre-env
|
| 72 |
|
| 73 |
python3 -m venv .venv
|
|
@@ -280,7 +280,7 @@ app_port: 7860
|
|
| 280 |
|
| 281 |
```bash
|
| 282 |
# One-time: add the HF Space as a git remote
|
| 283 |
-
git remote add hf https://huggingface.co/spaces/Madhav189/
|
| 284 |
|
| 285 |
# Push (HF prompts for token if not cached)
|
| 286 |
git push hf main
|
|
@@ -340,8 +340,8 @@ bash demo/run_demo.sh
|
|
| 340 |
|
| 341 |
## 12. Submission checklist
|
| 342 |
|
| 343 |
-
- [x] Repo public on GitHub: https://github.com/Madhav-GPT/
|
| 344 |
-
- [x] HF Space live: https://huggingface.co/spaces/Madhav189/
|
| 345 |
- [x] BLOG.md at repo root
|
| 346 |
- [x] 6 blog assets in `docs/blog/`
|
| 347 |
- [x] Training notebook executed end-to-end, results in `eval/results/`
|
|
|
|
| 67 |
## 2. Local setup
|
| 68 |
|
| 69 |
```bash
|
| 70 |
+
git clone https://github.com/Madhav-GPT/SystemTruth.git
|
| 71 |
cd sre-env
|
| 72 |
|
| 73 |
python3 -m venv .venv
|
|
|
|
| 280 |
|
| 281 |
```bash
|
| 282 |
# One-time: add the HF Space as a git remote
|
| 283 |
+
git remote add hf https://huggingface.co/spaces/Madhav189/SystemTruth
|
| 284 |
|
| 285 |
# Push (HF prompts for token if not cached)
|
| 286 |
git push hf main
|
|
|
|
| 340 |
|
| 341 |
## 12. Submission checklist
|
| 342 |
|
| 343 |
+
- [x] Repo public on GitHub: https://github.com/Madhav-GPT/SystemTruth
|
| 344 |
+
- [x] HF Space live: https://huggingface.co/spaces/Madhav189/SystemTruth
|
| 345 |
- [x] BLOG.md at repo root
|
| 346 |
- [x] 6 blog assets in `docs/blog/`
|
| 347 |
- [x] Training notebook executed end-to-end, results in `eval/results/`
|
|
@@ -116,7 +116,7 @@ training:
|
|
| 116 |
templates_covered: 12 # all 12 Triage templates have ≥5 episodes
|
| 117 |
|
| 118 |
huggingface:
|
| 119 |
-
space_id: Madhav189/
|
| 120 |
-
github_repo: Madhav-GPT/
|
| 121 |
sdk: docker
|
| 122 |
hardware: cpu-basic
|
|
|
|
| 116 |
templates_covered: 12 # all 12 Triage templates have ≥5 episodes
|
| 117 |
|
| 118 |
huggingface:
|
| 119 |
+
space_id: Madhav189/SystemTruth # canonical HF Space
|
| 120 |
+
github_repo: Madhav-GPT/SystemTruth # canonical GitHub
|
| 121 |
sdk: docker
|
| 122 |
hardware: cpu-basic
|