Spaces:

Madhav189
/

SystemTruth

Running

App Files Files Community

SystemTruth / openenv.yaml

Madhav189

SystemTruth rebrand: bigger UI, new diagrams, theme-cohesive HF Space

6583a07 about 1 month ago

raw

history blame contribute delete

5.46 kB

	name: sre-engineer-llm
	version: 3.1.0
	description: >
	Tier-escalating SRE training environment. Three tiers escalate three
	different dimensions: Triage (compute), Strategy (horizon), Operations
	(realism).

	Triage ships 12 templates x 6 entries each (1 base + 5 procgen variants) =
	72 deterministic scenarios over a 4-service topology. The 5-component
	rubric (outcome 0.45 + action_validity 0.20 + format 0.10 + anticheat
	0.15 + efficiency 0.10) sums to exactly 1.0 and pins a heuristic ceiling
	of [0.65, 0.80] with a scripted-optimal reference at >=0.90.

	Strategy runs as a Python orchestrator that chains Triage episodes with
	persistent horizon state (unresolved alerts, pending deploys, tech-debt
	counter, horizon-decay reward). It does not simulate a 15-20 service
	topology faithfully -- the YAMLs declare a richer action set as a design
	spec only.

	Operations runs as a Python state-machine simulator over a 22-node service
	graph using the same 11-action interface as Triage. The docker-compose
	stack under sre_gym/operations/families/ references stub images that are
	not published; it is shipped as design spec, not as runnable
	infrastructure.

	Training is not committed to this repo. See notebooks/01 for the GRPO
	pipeline that needs to be executed externally (Colab A100). The README
	baseline tables show frontier-LLM measurements only; the trained-model
	row is intentionally absent until a real run is committed.
	author: Daksh Verma
	license: Apache-2.0

	environment:
	action_type: UnifiedIncidentAction
	observation_type: UnifiedIncidentObservation
	state_type: UnifiedIncidentState
	max_steps: 13
	difficulties: [easy, medium, hard]
	reward_type: dense
	# GRPO/TRL training contract: parallel env instances per training step.
	# Required for the OpenEnv batched-rollout pattern documented at
	# https://huggingface.co/docs/trl/openenv.
	max_concurrent_envs: 64
	scenario_count: 72 # 12 templates x 6 entries (1 base + 5 procgen)
	scenario_templates: 12
	procgen_variants_per_template: 5 # 5 variants per template; 6 entries total each
	deterministic_seeded: true
	tier: triage # the runnable surface served by this Space
	tier_escalation_dimension: compute # see docs/ARCHITECTURE.md for full design

	tiers:
	triage:
	runnable: true
	runnable_kind: live_environment # real /reset + /step routes against the live env
	escalation_dimension: compute
	persona: "ML student / Kaggle, $30 of HF credits"
	scenario_count: 72
	docs: docs/TRIAGE_TIER.md
	notes: "12 base templates + 5 procgen variants each = 72 deterministic scenarios."
	strategy:
	runnable: true
	runnable_kind: python_orchestrator # chained Triage episodes with horizon state
	escalation_dimension: horizon
	persona: "seed/Series A startup, $300-500 budget"
	scenario_count: 3
	docs: docs/STRATEGY_TIER.md
	notes: >
	Strategy runs each scenario as a sequence of Triage episodes glued
	together by the horizon-state object (unresolved alerts, pending
	deploys, tech-debt counter, horizon-decay reward). It is NOT a
	simulator of a 15-20 service topology; the wider action universe
	declared in the YAMLs (~28 actions / scenario) is design spec
	only and is not implemented in the env.
	operations:
	runnable: true
	runnable_kind: python_simulator # graph state machine, not real cluster
	escalation_dimension: realism
	persona: "enterprise SRE platform, 8x A100/H100"
	scenario_count: 1 # one specced family
	chaos_pattern_count: 12 # includes one alias (payment_webhook_storm)
	docs: docs/OPERATIONS_TIER.md
	notes: >
	Operations runs as an in-memory 22-node graph mutator. Reuses the
	Triage 11-action interface; correct_action across patterns is heavily
	skewed toward rollback_deploy (11/12) so the patterns are separable
	on observation alone -- not a hidden-information benchmark. The
	compose file under sre_gym/operations/families/ references stub
	images that are NOT published; do not attempt `docker compose up`
	expecting it to pull successfully.

	reward:
	rubric_components:
	outcome: 0.45
	action_validity: 0.20
	format: 0.10
	anticheat: 0.15
	efficiency: 0.10
	composite_total: 1.0
	heuristic_ceiling_band: [0.65, 0.80]
	scripted_expert_floor: 0.90
	docs: docs/REWARD_DESIGN.md

	training:
	status: executed # SFT + GRPO end-to-end run completed
	base_model: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
	pipeline: notebooks/01_triage_train_grpo_qwen25_7b.ipynb
	comparison: notebooks/02_triage_eval_compare_all.ipynb
	pool_server: coliseum/ # parallel-rollout lease pool for GRPO trainers
	current_artifacts:
	- outputs/qwen25_7b_sft_final/ # SFT checkpoint, eval perplexity 1.755
	- outputs/qwen25_7b_grpo_final/ # GRPO checkpoint
	- eval/results/qwen25_7b_comparison_raw.csv
	- eval/results/qwen25_7b_comparison_hero.png
	trajectory_corpus:
	seed_v2_120_jsonl_rows: 120 # 72 expert + 24 mediocre + 24 failure
	templates_covered: 12 # all 12 Triage templates have ≥5 episodes

	huggingface:
	space_id: Madhav189/SystemTruth # canonical HF Space
	github_repo: Madhav-GPT/SystemTruth # canonical GitHub
	sdk: docker
	hardware: cpu-basic