SystemTruth / docs /ARCHITECTURE.md
Madhav189's picture
finalization: blog + README + execution rewrite, drop 3B + openclaw shim
0058c94

Architecture

The single insight that ties the whole pitch together: each tier escalates a different dimension β€” compute (Basic) β†’ horizon (Advanced) β†’ realism (Max) β€” not just scenario count. If you only read one paragraph of this document, read this one.

This document explains why sre-gym ships three tiers, what each tier defends as a research question, and how the package is structured to make that defence visible from a 30-second skim of the repo.


1. The dimensional-escalation insight

The standard pattern in agentic-LLM evaluation, surveyed across SWE-bench Lite/Verified/Pro, MLE-bench Low/Med/High, ITBench-static/live, WebArena/-Verified/-Hard, and CRMArena/-Pro, is to escalate a single axis (volume, complexity, dataset size, horizon length) across difficulty bands. That works for benchmarks whose underlying capability is one-dimensional. SRE is not one-dimensional.

A junior on-call engineer learning to triage faces a fundamentally different bottleneck (cognitive efficiency under tight context) than a senior SRE running a multi-incident postmortem (state tracking across long horizons), which faces a fundamentally different bottleneck than an enterprise platform team operating against an actively chaos-engineered cluster (operating in a partially-observable, adversarial, irreversible world). Their training signals, episode shapes, observation richness, and reward structures should not look the same.

sre-gym takes that observation seriously and stratifies its tiers along the dimension that actually limits the persona's training loop:

Tier Bottleneck Persona implication Episode shape
Basic Compute ($30 of HF credits, 1 A100 ~12h) Pre-digested observations, dense reward shaping, 8K context, 11-action space 8–13 ticks
Advanced Horizon ($300–500 budget, 1–2 A100-days) Multi-incident chains, partial observability, 28-action space, on-call peer 60–90 ticks
Max Realism (8Γ—A100/H100, real chaos eng) Ephemeral docker-compose / k3d, real kubectl / Vercel / Stripe APIs, subprocess shell, learned-critic rewards 110–180+ actions, real wall-clock minutes

This framing is defensible as research: the claim is that "training on a tier-1 environment that's causally rich but compute-cheap will produce a 3B specialist that beats Haiku on incident triage" is testable in 12 hours. The claim that "training on a tier-3 environment that includes real subprocess-shell access will produce an SRE agent that can actually go on-call" is testable in months and tens of thousands of dollars. Those are different research questions and the environment should make them visibly different.


2. Repository layout, with the design intent annotated

sre-enginnerllm/
β”œβ”€β”€ sre_gym/                          # Tier-aware public package
β”‚   β”œβ”€β”€ env.py                        # SREGym(tier=...) factory; Basic delegates,
β”‚   β”‚                                 # Advanced/Max raise TierNotRunnableError
β”‚   β”‚                                 # carrying a docs pointer.
β”‚   β”œβ”€β”€ tier.py                       # Tier enum + TierConfig; the
β”‚   β”‚                                 # escalation_dimension field is the
β”‚   β”‚                                 # load-bearing piece of the pitch.
β”‚   β”œβ”€β”€ advanced/scenarios/*.yaml     # 3 reference scenarios, real YAML, real
β”‚   β”‚                                 # topology, real reward dimensions, real
β”‚   β”‚                                 # reference traces β€” but the simulator
β”‚   β”‚                                 # backing them is intentionally not built.
β”‚   └── max/                          # 1 fully-specced family, real
β”‚       β”œβ”€β”€ families/*.yaml           # docker-compose, real chaos library,
β”‚       β”œβ”€β”€ chaos/*.yaml              # real workload generator config.
β”‚       └── compose/*.yaml            # Provisioning is left to the operator.
β”‚
└── unified_incident_env/             # Basic-tier core; the v2 surface that the
                                      # HF Space serves and openenv.yaml declares.

Two design choices worth calling out:

(a) Basic delegates to unified_incident_env rather than reimplementing. The Basic tier's runnable surface is the existing v2 environment, kept verbatim β€” including its 36+ test suite, scripted-optimal baselines, and HF Space deployment. The sre_gym wrapper is intentionally thin: it adds the tier flag, the introspection methods, and the YAML-spec loader for the design-only tiers. This is the difference between "a single-tier env that's hard to extend" and "an env that visibly carries the three-tier story while still passing every Basic-tier test it ever passed."

(b) Advanced and Max are shipped as data, not as code. The YAML specs in sre_gym/advanced/scenarios/ and sre_gym/max/families/ are real: they reference real topologies, real action sets, real reward dimensions, real chaos patterns, and real reference traces. They're loaded by SREGym.list_scenarios() and renderable by the playground. What's missing is the simulator backing them β€” and that's deliberate. Building a credible Max-tier simulator in 36 hours is a fantasy; building a credible Max-tier spec that downstream operators can lift into a $40-150/day cluster is achievable. The cost of pretending Max is shipped runnable would be zero credibility with judges who have run real chaos engineering. The cost of shipping it as a credible vision is one extra YAML file.


3. The Basic tier in detail

The Basic tier is the only tier we trained against, so it gets the most concrete defence.

3.1 What it is

12 base templates Γ— 5 procgen variants = 72 deterministic scenarios. Procgen jitters metric values, deploy timestamps, and noise-service rotation while preserving the causal structure β€” so a trained agent can't memorize fingerprints. Holding out one variant per template gives a 60-train / 12-eval split.

3.2 The 12 templates and what each one teaches

# Template Skill Decoy / red herring
1 worker_deploy_cascade deploy-history reasoning none β€” the easy entry point
2 db_config_rollout config-vs-code disambiguation concurrent worker deploy
3 gateway_auth_rollout wrong-loud-service trap worker queue-depth alert
4 payment_webhook_misconfig downstream symptom (Stripe) DB write-rate drop
5 schema_drift_missing_migration application vs DB blame DB looks healthy
6 cache_stale_state metrics-look-good-but-customers-don't cache hit rate is up
7 dep_degradation "your service vs theirs" worker CPU is loud
8 memory_leak_oom restart count > error count DB CPU spikes look like DB fault
9 auth_token_expiry cross-service credential propagation gateway is the loudest service
10 network_partition trust connectivity, not self-reports cache reports healthy in own metrics
11 rate_limit_retry_storm counterintuitive (more retries = worse) DB CPU/connections look pathological
12 migration_lock lock contention without crash worker errors look like a worker fault

Each template contains a different cognitive failure mode. A 12-template catalogue with 12 different failure modes is a denser training signal than a 60-template catalogue that all reduce to "look at the deploy that just happened" β€” that's the depth-not-quantity argument.

3.3 Why these compute knobs

  • 8K context β€” fits the entire Basic episode (12 ticks Γ— ~600 tokens of observation + ~80 tokens of action) inside the trained model's working set without truncation. Letting the trajectory spill into 16K context would force a smaller model (more A100 hours per token) or a longer training run (more wall-clock).
  • 11 actions β€” small enough that GRPO group-relative advantages converge in 600–1000 steps. Add 5 more actions and the policy has 50% more dimensions to explore at the same compute budget, which usually means a noisier gradient.
  • 5-component dense reward β€” recovery + containment + verification + impact + efficiency, with shaped intermediate signal (potential-function differences). Pure terminal rewards converge slower than dense shaping at the same compute budget; that's a well-known result and we're paying it.
  • 12 templates Γ— 5 variants procgen β€” enough variety that a held-out variant is genuinely held-out, but not so much that scenario-specific overfitting eats the training budget.

These knobs collectively are the "compute" in "compute-bounded". Tighten any of them and a Series-A-class operator would call it out as theatre.

3.4 What "compute-bounded" actually means in numbers

A representative training run looks like:

Phase Steps Compute Wall-clock on A100 40GB Output
Seed dataset build 200 (Claude-driven) ~$15 of API spend 2h train/data/seed_combined.jsonl
SFT cold start 500 steps, batch 4 ~3h 3h LoRA r=64 adapter
GRPO online 800 steps, K=4 rollouts ~6h 6h trained adapter
Eval sweep 36 episodes (3 per template) ~30min 30min eval/results/comparison.csv
Total ~$15 API + 12h GPU 12h trained 3B + comparison table

That's the $30-of-HF-credits budget the design targets, comfortably.


4. The Advanced tier in detail

4.1 The horizon escalation, formalized

Advanced is bounded by horizon: episodes are 60–90 ticks instead of 12, multi-incident chains span 5+ minutes of simulated time, and the agent must track state that no single 8K context window can hold. Three properties make this a different research question:

  1. Multi-incident composition. One template's resolution can become another template's setup state. Scenario 1 (cascading_release_train.yaml) is the canonical example: rolling back the gateway is correct, but it triggers a downstream worker drift that materializes 25 ticks later. The agent has to recognize the chained incident as caused by their own fix and reach for a second rollback rather than treating it as a fresh outage.
  2. Partial-observability noise. Sometimes query_logs returns degraded data because the logging pipeline is the affected service. Scenario 2 (observability_pipeline_outage.yaml) is the canonical example: the agent must drop log sampling and toggle verbose-logging off before attempting root-cause diagnosis, because the diagnostic tool is itself broken.
  3. Cross-domain reasoning. Scenario 3 (supabase_rls_silent_leak.yaml) is a reliability incident with a security root cause. The agent must classify it correctly (reach for escalate_security rather than the platform on-call), contain the data leak via feature flag before rolling back, and produce a postmortem with a leak-window calculation. No existing SRE benchmark scores cross-domain reasoning.

4.2 Why this isn't trained in this repo

A faithful Advanced simulator would need:

  • a 15–20 service event-loop simulator (not a 4-service one)
  • multi-tick fault propagation (one fix triggering a chained fault N ticks later, with proper causal latency)
  • a synthetic on-call-peer model that responds to escalations
  • ~28 action handlers, vs. 11 in Basic
  • a learned-critic reward path for postmortem quality

That's roughly 2 weeks of focused engineering and 1–2 A100-days of training. Both are out of scope for the 36-hour hackathon window. We ship the design at the YAML level so that a downstream operator with the budget can lift it; we do not pretend it was trained.


5. The Max tier in detail

5.1 The realism escalation, formalized

Max is bounded by realism. The world stops being a simulator. A reset() provisions a fresh 22-service docker-compose stack (Vercel + Supabase + Stripe + Postgres + Redis + Kafka + 3 worker pools + observability stubs + chaos controller). The agent's rollback_deploy is a real kubectl rollout undo against that stack. query_logs reads from a real Loki/Promtail pipeline. query_traces reads from a real Tempo cluster. Faults are injected via a real Chaos-Mesh-style chaos library. Reward is computed from the actual recovery state of the actual stack.

Three properties are unique to Max:

  1. Real subprocess access. The agent has a sandboxed shell. It can write code, commit it to a sandboxed git mirror, push, watch CI, observe a deploy, roll back. This is the "real hard work instead of exploiting shortcuts" spec from Theme #3.1 of the OpenEnv brief.
  2. Real action irreversibility. A real Stripe refund is a real Stripe refund. The cluster is destroyed on next reset, but actions taken at the application layer are real.
  3. Outcome-scored rewards. A second small model evaluates the agent's postmortem against the actual recovery trajectory. Reward is no longer a deterministic rubric β€” it's an outcome judgement, with shaping signals as auxiliary.

5.2 Why one family, not 30

Per the design rationale: one fully-specced family (e-commerce + Stripe + Supabase + Vercel) with compose.max.yaml, an 11-pattern chaos library, a workload generator config, a reference instance with an expected 110-action trajectory, and operator notes for cost / isolation / safety is more credible than a vague "30+ scenario families" claim. Judges who run real infrastructure would ask "show me one"; if you can't, the whole tier framing collapses. The ecommerce_vibecoded_saas family is the show-me-one.

5.3 What's deliberately not in this repo

  • The published stub images (ghcr.io/sre-gym/*) β€” publishing them is a $1–2k registry-cost commitment that doesn't fit the hackathon-window budget.
  • A running cluster β€” bringing the Max tier up costs $40–150/day depending on cluster size and chaos cadence.
  • A trained model against Max β€” that's a multi-week, multi-A100 commitment.

What is in this repo is the spec at the level of detail a downstream operator can actually act on: docker-compose, chaos-library YAML, workload generator config, family-level scenario population spec, and operator-notes block covering cost, isolation, and reset safety.


6. The contract that ties Basic, Advanced, and Max together

All three tiers share the same five abstract objects:

Object Basic concrete Advanced concrete Max concrete
Topology 4 services hard-coded 15–20 services in YAML 22 services in docker-compose
Action set 11, Pydantic-validated 28, validated against allowed_actions: list 50+, including subprocess shell
Observation pre-digested fields noisy multi-source feed raw Prometheus / Loki / Tempo
Reward 7-dim deterministic rubric rubric + chained-incident bonus + postmortem critic outcome + learned-critic + IaC-remediation bonus
Episode 8–13 ticks 60–90 ticks unbounded (real wall-clock minutes)

Same shape, escalating depth. That's what makes the tier story coherent rather than three unrelated environments stacked in one repo.


7. OpenEnv framework integration

Basic uses:

  • openenv.core.env_server.Environment[A, O, S] base class
  • Typed Pydantic Action / Observation / State
  • /reset /step /state HTTP endpoints via create_fastapi_app
  • max_concurrent_envs for batched rollouts (the GRPO contract)
  • Custom /tasks /baseline /grader /status /health extension routes for scenario-catalog-and-grader introspection

Advanced and Max would extend this with:

  • MCPEnvironment base + @self.tool()-registered actions for production serving
  • WebSocket /ws transport for low-latency multi-agent rollouts
  • ServerMode.SIMULATION vs ServerMode.PRODUCTION switch for tier-3 real-API actions
  • Custom Gradio TabbedInterface with a "topology inspector" tab

These are documented in the per-tier docs but not implemented here.


8. The judging-criteria mapping

The OpenEnv hackathon's published rubric weights are: Innovation 40, Storytelling 30, Reward Curves 20, Reward/Pipeline 10. sre-gym is built so each weight has a concrete corresponding artifact:

Weight Artifact in this repo
Innovation 40% The dimensional-escalation tier story, defensible across compute / horizon / realism axes.
Storytelling 30% This document + README.md first paragraph + 12-template skill table + reference traces in YAMLs.
Reward curves 20% eval/results/qwen25_7b_comparison_* (populated by cell 10/11 of 01_triage_train_grpo_qwen25_7b.ipynb) + train/data/eval_sweep_baselines.jsonl reference numbers.
Reward/Pipeline 10% Composable rubric in unified_incident_env/server/grader.py + SFT β†’ GRPO loop in 01_triage_train_grpo_qwen25_7b.ipynb.

Weight by weight, the artifact you'd hand a judge to defend that score is in this repo.