Spaces:

dakshdoesdev
/

sre-gym

Running

App Files Files Community

dakshdoesdev commited on Apr 23

Commit

dc8501a

verified ·

1 Parent(s): 928cc3c

deploy sre-gym v2: easy/medium/hard scenarios + skill + verified-runbooks + demo

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.dockerignore +18 -0
.gitignore +12 -0
Dockerfile +21 -0
Makefile +39 -0
README.md +111 -4
demo/pitch.md +49 -0
demo/run_demo.sh +70 -0
deploy/push_to_hf.sh +58 -0
execution.md +53 -0
inference.py +264 -0
openenv.yaml +21 -0
pyproject.toml +43 -0
requirements.txt +1 -0
run_demo.py +86 -0
server/Dockerfile +21 -0
server/__init__.py +1 -0
server/app.py +14 -0
server/requirements.txt +8 -0
skill/SKILL.md +100 -0
skill/tools/sre_gym_client.py +238 -0
skill/verified-runbooks/.gitkeep +0 -0
skill/verified-runbooks/db_config_rollout.md +23 -0
skill/verified-runbooks/gateway_auth_rollout.md +21 -0
skill/verified-runbooks/worker_deploy_cascade.md +23 -0
unified_incident_env/README.md +10 -0
unified_incident_env/__init__.py +17 -0
unified_incident_env/client.py +35 -0
unified_incident_env/interface.py +17 -0
unified_incident_env/models.py +332 -0
unified_incident_env/scripts/__init__.py +1 -0
unified_incident_env/scripts/baseline_agent.py +43 -0
unified_incident_env/scripts/walkthrough.py +41 -0
unified_incident_env/server/__init__.py +1 -0
unified_incident_env/server/app.py +148 -0
unified_incident_env/server/challenge.py +753 -0
unified_incident_env/server/environment.py +613 -0
unified_incident_env/server/grader.py +145 -0
unified_incident_env/tests/__init__.py +1 -0
unified_incident_env/tests/test_environment.py +192 -0
unified_incident_env/tests/test_submission_inference.py +119 -0
unified_incident_env/tests/test_trainer.py +46 -0
unified_incident_env/tests/test_trainer_session.py +32 -0
unified_incident_env/trainer/__init__.py +8 -0
unified_incident_env/trainer/action_adapter.py +204 -0
unified_incident_env/trainer/analyze_failures.py +251 -0
unified_incident_env/trainer/backend.py +165 -0
unified_incident_env/trainer/build_datasets.py +258 -0
unified_incident_env/trainer/build_sft_dataset.py +101 -0
unified_incident_env/trainer/collect_trajectory.py +53 -0
unified_incident_env/trainer/eval_models.py +145 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,18 @@

+.venv/
+__pycache__/
+*.pyc
+.git/
+.pytest_cache/
+outputs/
+.omx/
+.codex/
+AGENTS.md
+sre_env/
+*.egg-info/
+dist/
+build/
+.gemini/
+madhav_trial/
+*.png
+*.npz
+node_modules/

.gitignore ADDED Viewed

	@@ -0,0 +1,12 @@

+.venv/
+__pycache__/
+.pytest_cache/
+*.pyc
+learning_curve.png
+.omx/
+.codex/
+outputs/
+AGENTS.md
+.sisyphus/
+*.egg-info/
+uv.lock

Dockerfile ADDED Viewed

	@@ -0,0 +1,21 @@

+FROM python:3.11-slim
+WORKDIR /app
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    ENABLE_WEB_INTERFACE=true
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+COPY . /app
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir .
+EXPOSE 8000
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

Makefile ADDED Viewed

	@@ -0,0 +1,39 @@

+.PHONY: install dev test baseline walkthrough trainer-eval trainer-dataset trainer-session docker-build docker-run validate clean
+install:
+	python3 -m pip install -e ".[dev]"
+	@echo "Dependencies installed"
+dev:
+	ENABLE_WEB_INTERFACE=true uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+test:
+	pytest unified_incident_env/tests -v --tb=short
+baseline:
+	python -m unified_incident_env.scripts.baseline_agent
+walkthrough:
+	python -m unified_incident_env.scripts.walkthrough --scenario easy_sqli_db_outage
+trainer-eval:
+	python -m unified_incident_env.trainer.eval_models --models qwen2.5:0.5b gemma2:2b qwen2.5:7b-instruct-q4_K_M --mode strict
+trainer-dataset:
+	python -m unified_incident_env.trainer.build_sft_dataset --source combined --output outputs/trainer/sft_dataset.jsonl
+trainer-session:
+	python -m unified_incident_env.trainer.run_session --model qwen2.5:0.5b --base-url http://127.0.0.1:8000
+docker-build:
+	docker buildx build --platform linux/amd64 -t sre-env:latest .
+docker-run:
+	docker run -p 8000:8000 -e ENABLE_WEB_INTERFACE=true sre-env:latest
+validate:
+	openenv validate .
+clean:
+	rm -rf outputs __pycache__ .pytest_cache
+	find . -name "*.pyc" -delete

README.md CHANGED Viewed

@@ -1,10 +1,117 @@
 ---
-title: Sre Gym
-emoji: 📚
 colorFrom: red
-colorTo: pink
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: SRE Gym
+emoji: 🚨
 colorFrom: red
+colorTo: yellow
 sdk: docker
+app_port: 8000
 pinned: false
+license: apache-2.0
 ---
+# sre-gym — Fault-injecting SRE training env for OpenEnv
+Most SRE agent skills are runbooks and good intentions. **sre-gym** is the other half: a fault-injecting environment with deterministic grading where an agent diagnoses a real production-style incident, chooses a safe remediation, verifies recovery, and declares resolved. Every run is scored the same way twice.
+- Spec-compliant OpenEnv environment (typed Pydantic action / observation / state, `reset` / `step` / `state`, `openenv validate` green).
+- 3 curriculum scenarios — easy, medium, hard — with decoy services and causal dependencies.
+- 11 bounded actions. Honest state transitions. No hidden oracles.
+- 21 tests passing.
+- Ships a Claude Code skill + verified-runbook loop — successful solves write markdown runbooks that the next run reads back.
+## 30-second demo
+```bash
+./demo/run_demo.sh
+```
+Starts the env, solves each scenario cold, writes a runbook for each, re-solves to prove the loop. Full transcript takes ~10 seconds.
+## Curriculum
+| Difficulty | Scenario | Story | Decoy | Correct path |
+|---|---|---|---|---|
+| easy | `worker_deploy_cascade` | Bad worker deploy → DB crash-loop → login 502s | — | rollback worker → restart db → verify → resolve |
+| medium | `db_config_rollout` | DB config push shrank connection pool from 80→12 | recent worker deploy | rollback **db** → restart db → verify → resolve |
+| hard | `gateway_auth_rollout` | Gateway auth-middleware rollout rejects valid logins | recent worker deploy | rollback **gateway** → verify → resolve (no restart) |
+Rolling back the wrong service returns a negative reward and `failure_type="wrong_remediation_target"`. Restarting before the cause is removed re-inherits the bad state. `declare_resolved` is rejected until the scenario's resolution check passes against the actual world model.
+## Install
+```bash
+# 1. Create a venv and install
+python3 -m venv .venv && source .venv/bin/activate
+pip install -e '.[dev]'
+# 2. Start the env
+uvicorn server.app:app --host 127.0.0.1 --port 8000
+# 3. Run the baseline inference against it
+export HF_TOKEN="…"; export ENV_BASE_URL=http://127.0.0.1:8000
+python inference.py
+```
+## Install the Claude Code skill
+```bash
+ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"
+```
+Then, in Claude Code, ask: *"Solve the db_config_rollout scenario in sre-gym."* The skill will drive the env via `skill/tools/sre_gym_client.py`, load any existing runbook from `skill/verified-runbooks/`, and append a fresh runbook on any clean solve (score > 0.85).
+## Architecture
+```
+┌────────────────────┐      HTTP / WS       ┌──────────────────────┐
+│  Claude Code       │ ──────────────────▶ │  OpenEnv server       │
+│  (with sre-gym     │ ◀────────────────── │  (FastAPI, uvicorn)   │
+│   skill loaded)    │    obs, reward      │  unified_incident_env │
+└────────────────────┘                     └──────────────────────┘
+        │                                            ▲
+        ▼ on clean solve (score > 0.85)              │
+┌────────────────────┐                               │
+│ verified-runbooks/ │ ────── loaded at skill load ──┘
+│   *.md             │
+└────────────────────┘
+```
+## Scoring
+Deterministic, 5 dimensions, sums to a public score in `[0.01, 0.99]`:
+- **Recovery** (0–0.4): critical-path services healthy
+- **Containment** (0–0.3): root cause removed or offending service isolated
+- **Verification** (0–0.35): `database_recovery` + `end_to_end` checks passed
+- **Impact** (0–0.15): user-impact reduced
+- **Efficiency** (0–0.10): budget preserved, no wasteful repeats
+Target **> 0.85** for "clean solve." That's also the runbook-record threshold.
+## Repo layout
+```
+unified_incident_env/    # env core: models, environment, grader, challenge, tests
+server/                  # OpenEnv entrypoint wrapper
+skill/                   # Claude Code skill: SKILL.md, tools/, verified-runbooks/
+demo/                    # run_demo.sh + pitch.md
+inference.py             # OpenAI-client baseline for OpenEnv hackathon submission
+openenv.yaml             # OpenEnv manifest
+Dockerfile               # HF Space deployment
+```
+## Verify
+```bash
+pytest unified_incident_env/tests -q          # 21 tests
+python -m openenv.cli validate .              # OpenEnv manifest check
+docker build -t sre-engineer-llm:v2 .         # HF Space image
+```
+## Roadmap — v2
+Distill the accumulated `verified-runbooks/` corpus into a local 3B reviewer via [OpenClaw-RL](https://github.com/Gen-Verse/OpenClaw-RL)'s async GRPO-on-next-state loop. Same reward contract (`run_check` passes / `failure_type` absent), same grader, but a compact policy that runs without a frontier API.
+## License
+Apache 2.0

demo/pitch.md ADDED Viewed

	@@ -0,0 +1,49 @@

+# sre-gym — 60-second pitch
+> You can't train SRE agents on production. We built the gym.
+## The story (00:00–01:00)
+**[0:00–0:10 · Hook]** "Most SRE agent skills are prompts — a runbook and a good intention. We built the other half: a fault-injecting environment with deterministic grading, where every run is scored the same way twice."
+**[0:10–0:25 · What it is]**
+- OpenEnv-compliant. `openenv validate` passes.
+- Three curriculum scenarios, easy → hard:
+  - **easy** `worker_deploy_cascade` — bad worker deploy cascades to a DB crash.
+  - **medium** `db_config_rollout` — DB config shrank the connection pool; a recent worker deploy is a decoy.
+  - **hard** `gateway_auth_rollout` — bad auth-middleware rollout; two plausible suspects, one right answer.
+- 11 bounded actions, honest state transitions (rolling back the wrong thing *fails*), deterministic grader across recovery / containment / verification / impact / efficiency.
+- 21 tests passing. One public Space URL.
+**[0:25–0:55 · Live demo]** `./demo/run_demo.sh`
+- Env starts. Three scenarios visible in `/tasks`.
+- Runbook dir cleared; demo starts cold.
+- Each scenario solves end-to-end (score ≈ 0.99, 8–10 steps).
+- A markdown runbook is written per scenario from the successful trace.
+- Re-solve the easy scenario — this time the skill loads the runbook first. Same score, same path, zero wasted investigation.
+- Point to `skill/verified-runbooks/` — "Every clean solve makes the next one deterministic. No GRPO required for v1."
+**[0:55–1:00 · Close]** "Install the skill by symlinking `skill/` into `~/.claude/skills/sre-gym`. Open source, Apache 2. v2 is the OpenClaw-RL loop — distill this corpus of verified runbooks into a local 3B reviewer."
+## The one technical claim you should be ready to defend
+> "The env is honest."
+- No hidden oracles. Rolling back the wrong service returns a negative reward and `failure_type="wrong_remediation_target"` — same observation contract as any other action.
+- `declare_resolved` is rejected until the scenario's `resolution_check` passes, verified by actual service states in the world model, not a flag the grader peeks at.
+- Rewards reward *effects*, not evidence-gathering — you can't farm the env by spamming `query_logs`.
+- `restart_service` on the database before the root cause is removed returns a negative reward. Always. Because in the real world, it would crash again.
+## Judge Q&A cheat sheet
+**"How is this different from running a real staging env?"**
+Deterministic scoring. Every agent gets graded against the same signatures, same decoys, same tick budget. You can't do that on real infra.
+**"Why only three scenarios?"**
+Three clears the hackathon DQ gate (`easy/medium/hard`). Each has a decoy + causal chain — building another one is a data-entry exercise, not a design one. Adding scenarios #4–#20 is the v2 data scaling lane.
+**"Why runbooks instead of GRPO?"**
+For this submission, GRPO means 48 hours of training convergence risk on top of an env we just shipped. Markdown runbooks demonstrate the same loop (verified signal → persisted artefact → next run improves) in an auditable form. The GRPO wiring slots on top of the same traces when we're ready.
+**"What's the skill actually doing at runtime?"**
+The skill lives in `skill/SKILL.md`. It directs Claude (or any agent) to read `verified-runbooks/{scenario}.md` before the first action, drive the env through `skill/tools/sre_gym_client.py`, and append a fresh runbook on any solve with `final_score > 0.85`.

demo/run_demo.sh ADDED Viewed

	@@ -0,0 +1,70 @@

+#!/usr/bin/env bash
+# sre-gym end-to-end demo.
+# Spins up the env (or reuses a running one), solves each of the 3 scenarios
+# with the baseline policy, records runbooks, shows the artefacts.
+#
+# Requires: python3.10+, docker (for the HF-Space-equivalent image) OR the
+# repo's .venv. Defaults to .venv if present.
+set -euo pipefail
+cd "$(dirname "$0")/.."
+PORT="${PORT:-8013}"
+URL="http://127.0.0.1:${PORT}"
+PY="${PYTHON:-.venv/bin/python}"
+RUNBOOK_DIR="skill/verified-runbooks"
+banner() { printf '\n\033[1;36m== %s ==\033[0m\n' "$*"; }
+ok()     { printf '\033[0;32m  ✓ %s\033[0m\n' "$*"; }
+banner "0 / preflight"
+if [[ ! -x "$PY" ]]; then
+  echo "  note: $PY not found, falling back to system python3" >&2
+  PY="python3"
+fi
+"$PY" -c "import unified_incident_env" 2>/dev/null || {
+  echo "  error: unified_incident_env not importable; run 'pip install -e .' first" >&2
+  exit 1
+}
+ok "python + package ready"
+banner "1 / start env"
+if curl -sf "$URL/health" > /dev/null 2>&1; then
+  ok "env already running on $URL"
+  SERVER_STARTED=0
+else
+  "$PY" -m uvicorn server.app:app --host 127.0.0.1 --port "$PORT" > /tmp/sre_gym_demo.log 2>&1 &
+  SERVER_PID=$!
+  SERVER_STARTED=1
+  for _ in $(seq 1 20); do
+    if curl -sf "$URL/health" > /dev/null 2>&1; then break; fi
+    sleep 0.3
+  done
+  curl -sf "$URL/health" > /dev/null || { echo "  error: env failed to start" >&2; cat /tmp/sre_gym_demo.log >&2; exit 1; }
+  ok "env started on $URL (pid $SERVER_PID)"
+fi
+trap '[[ ${SERVER_STARTED:-0} -eq 1 ]] && kill ${SERVER_PID:-0} 2>/dev/null || true' EXIT
+banner "2 / available scenarios"
+SRE_GYM_URL="$URL" "$PY" skill/tools/sre_gym_client.py list
+banner "3 / clear prior runbooks (demo starts cold)"
+rm -f "$RUNBOOK_DIR"/*.md
+ok "runbook directory cleared"
+for scenario in worker_deploy_cascade db_config_rollout gateway_auth_rollout; do
+  banner "4 / solve: $scenario"
+  SRE_GYM_URL="$URL" "$PY" skill/tools/sre_gym_client.py solve "$scenario"
+  SRE_GYM_URL="$URL" "$PY" skill/tools/sre_gym_client.py record-runbook "$scenario"
+done
+banner "5 / verified runbooks now on disk"
+ls -1 "$RUNBOOK_DIR"/*.md | sed 's|^|  |'
+banner "6 / re-solve easy scenario — runbook is loaded this time"
+SRE_GYM_URL="$URL" "$PY" skill/tools/sre_gym_client.py solve worker_deploy_cascade | tail -4
+banner "done"
+echo "  install the skill globally:   ln -s \"$PWD/skill\" \"\$HOME/.claude/skills/sre-gym\""
+echo "  env log:                      /tmp/sre_gym_demo.log"
+echo "  runbooks:                     $RUNBOOK_DIR/"

deploy/push_to_hf.sh ADDED Viewed

	@@ -0,0 +1,58 @@

+#!/usr/bin/env bash
+# Deploy this repo to a Hugging Face Space (Docker SDK).
+#
+# Required:
+#   HF_TOKEN      write-scoped HF access token
+#   HF_SPACE_ID   e.g. yourname/sre-gym  (create it at huggingface.co/new-space
+#                 first, SDK=Docker, or let this script try to create it)
+#
+# Usage:
+#   HF_TOKEN=hf_xxx HF_SPACE_ID=yourname/sre-gym ./deploy/push_to_hf.sh
+#
+# After a successful push, verify from a different network:
+#   curl https://${space_subdomain}.hf.space/health
+#   curl https://${space_subdomain}.hf.space/tasks | jq '.scenarios[].difficulty'
+set -euo pipefail
+cd "$(dirname "$0")/.."
+: "${HF_TOKEN:?HF_TOKEN is required}"
+: "${HF_SPACE_ID:?HF_SPACE_ID is required, e.g. yourname/sre-gym}"
+if ! command -v huggingface-cli > /dev/null; then
+  echo "error: huggingface-cli not installed. pip install 'huggingface_hub[cli]'" >&2
+  exit 1
+fi
+echo "== syncing openenv.yaml with HF_SPACE_ID =="
+python3 - <<PY
+import pathlib, re
+path = pathlib.Path("openenv.yaml")
+text = path.read_text()
+text = re.sub(r"^  space_id:.*$", f"  space_id: $HF_SPACE_ID", text, flags=re.M)
+path.write_text(text)
+print(f"openenv.yaml space_id -> $HF_SPACE_ID")
+PY
+echo "== ensuring the space exists (idempotent) =="
+huggingface-cli repo create "$HF_SPACE_ID" \
+  --type space \
+  --space_sdk docker \
+  --token "$HF_TOKEN" \
+  --yes 2>&1 | grep -v "already created" || true
+echo "== uploading repo =="
+huggingface-cli upload "$HF_SPACE_ID" . \
+  --repo-type space \
+  --token "$HF_TOKEN" \
+  --commit-message "deploy sre-gym v2 (easy/medium/hard scenarios)"
+subdomain="$(echo "$HF_SPACE_ID" | tr '/' '-')"
+echo
+echo "== deployment kicked off =="
+echo "   Logs:     https://huggingface.co/spaces/$HF_SPACE_ID"
+echo "   Public:   https://$subdomain.hf.space"
+echo
+echo "== verify from a different network (phone hotspot) =="
+echo "   curl https://$subdomain.hf.space/health"
+echo "   curl https://$subdomain.hf.space/tasks | jq '.scenarios[].difficulty'"

execution.md ADDED Viewed

	@@ -0,0 +1,53 @@

+# How To Run (v2)
+## 1. Setup
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -e '.[dev]'
+```
+## 2. Start the environment
+```bash
+source .venv/bin/activate
+uvicorn server.app:app --host 127.0.0.1 --port 8000
+```
+## 3. Manual API smoke test
+```bash
+curl -X POST http://127.0.0.1:8000/reset -H 'content-type: application/json' -d '{}'
+curl -X POST http://127.0.0.1:8000/step -H 'content-type: application/json' -d '{"action":{"action_type":"query_deploys","service":"worker"}}'
+```
+## 4. Run inference
+```bash
+source .venv/bin/activate
+export HF_TOKEN="your_hf_token"
+export API_BASE_URL="https://router.huggingface.co/v1"
+export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct:novita"
+export ENV_BASE_URL="http://127.0.0.1:8000"
+python inference.py
+```
+## 5. Verification
+```bash
+source .venv/bin/activate
+pytest unified_incident_env/tests -q
+openenv validate .
+```
+## 6. Reward semantics
+- queries reveal evidence but do not directly mint positive breadcrumb reward
+- remediation actions change the world state
+- `run_check` verifies recovery explicitly
+- `declare_resolved` succeeds only after objective checks pass
+Public benchmark score is deterministic and separate from the per-step training reward.

inference.py ADDED Viewed

	@@ -0,0 +1,264 @@

+#!/usr/bin/env python3
+"""Submission inference script for the honest narrow incident environment."""
+from __future__ import annotations
+import json
+import os
+from typing import Any
+from openai import OpenAI
+from unified_incident_env.client import UnifiedIncidentEnv
+from unified_incident_env.models import UnifiedIncidentAction, UnifiedIncidentObservation
+from unified_incident_env.server.challenge import DEFAULT_SCENARIO_ID, SCENARIOS
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct:novita")
+HF_TOKEN = os.getenv("HF_TOKEN")
+ENV_BASE_URL = os.getenv("ENV_BASE_URL") or UnifiedIncidentEnv.DEFAULT_BASE_URL
+ENV_NAME = "unified-incident-env"
+MAX_TOKENS = 260
+def create_client() -> OpenAI | None:
+    if not HF_TOKEN:
+        return None
+    return OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
+def log_start(*, task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(*, step: int, action: str, reward: float, done: bool, error: str | None) -> None:
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={error or 'null'}",
+        flush=True,
+    )
+def log_end(*, success: bool, steps: int, score: float, rewards: list[float]) -> None:
+    rewards_text = ",".join(f"{reward:.2f}" for reward in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.2f} rewards={rewards_text}", flush=True)
+def _service_order(observation: UnifiedIncidentObservation) -> list[str]:
+    services = list(observation.service_health.items())
+    services.sort(key=lambda item: (item[1].status != "crashed", item[1].status != "degraded", -item[1].error_rate_pct))
+    return [name for name, _payload in services]
+def _default_action_for_type(action_type: str, observation: UnifiedIncidentObservation) -> dict[str, Any]:
+    services = _service_order(observation)
+    service = services[0] if services else "database"
+    if action_type in {"query_logs", "query_dependencies", "query_deploys", "rollback_deploy", "restart_service", "isolate_service"}:
+        if action_type == "rollback_deploy":
+            service = "worker"
+        return {"action_type": action_type, "service": service}
+    if action_type == "query_metrics":
+        return {"action_type": action_type, "service": service, "metric": "cpu"}
+    if action_type == "run_check":
+        check_name = "database_recovery"
+        if observation.service_health.get("database") and observation.service_health["database"].status == "healthy":
+            check_name = "end_to_end"
+        return {"action_type": action_type, "check_name": check_name}
+    if action_type == "submit_hypothesis":
+        return {
+            "action_type": "submit_hypothesis",
+            "hypothesis": {
+                "root_cause": "bad_worker_deploy",
+                "affected_services": ["worker", "database"],
+                "confidence": 0.5,
+                "recommended_next_action": "query_deploys",
+            },
+        }
+    return {"action_type": action_type}
+def parse_action(raw: str, observation: UnifiedIncidentObservation) -> UnifiedIncidentAction | None:
+    text = raw.strip()
+    if not text:
+        return None
+    try:
+        data = json.loads(text)
+    except Exception:
+        return None
+    if not isinstance(data, dict):
+        return None
+    if "action" in data and "action_type" not in data and isinstance(data["action"], str):
+        data = {**data, "action_type": data["action"]}
+        data.pop("action", None)
+    action_type = data.get("action_type")
+    if action_type not in observation.allowed_actions:
+        return None
+    try:
+        return UnifiedIncidentAction(**data)
+    except Exception:
+        return None
+def build_user_prompt(observation: UnifiedIncidentObservation) -> str:
+    required_lines = []
+    for action, fields in observation.required_fields_by_action.items():
+        required_lines.append(f"- {action}: {', '.join(fields) if fields else '(no extra fields)'}")
+    checks = "\n".join(
+        f"- {check.name}: {'passed' if check.passed else 'pending'} - {check.detail}"
+        for check in observation.checks
+    ) or "- none"
+    return (
+        "Return exactly one JSON object representing the next action.\n"
+        f"Current stage: {observation.workflow_stage}\n"
+        f"Incident summary: {observation.incident_summary}\n"
+        f"Current score: {observation.final_score:.4f}\n"
+        f"Last action result: {observation.last_action_result or 'none'}\n"
+        f"Tool output: {observation.tool_output or 'none'}\n"
+        f"Failure: {observation.failure_type or 'none'}\n"
+        f"Why failed: {observation.why_failed or 'none'}\n"
+        f"User impact: {observation.user_impact:.2f}\n"
+        f"SLO burn rate: {observation.slo_burn_rate:.2f}\n"
+        "Allowed actions:\n"
+        + "\n".join(f"- {action}" for action in observation.allowed_actions)
+        + "\nRequired fields:\n"
+        + "\n".join(required_lines)
+        + "\nChecks:\n"
+        + checks
+    )
+def _schema(observation: UnifiedIncidentObservation) -> dict[str, Any]:
+    properties: dict[str, Any] = {
+        "action_type": {"type": "string", "enum": observation.allowed_actions},
+        "service": {"type": "string", "enum": sorted(observation.service_health)},
+        "metric": {"type": "string", "enum": ["cpu", "error_rate", "latency"]},
+        "check_name": {"type": "string", "enum": ["database_recovery", "end_to_end"]},
+        "hypothesis": {
+            "type": "object",
+            "properties": {
+                "root_cause": {"type": "string", "enum": ["bad_worker_deploy", "database_only_failure", "api_gateway_fault"]},
+                "affected_services": {
+                    "type": "array",
+                    "items": {"type": "string", "enum": sorted(observation.service_health)},
+                    "minItems": 1,
+                },
+                "confidence": {"type": "number", "minimum": 0.0, "maximum": 1.0},
+                "recommended_next_action": {
+                    "type": "string",
+                    "enum": [
+                        "query_logs",
+                        "query_metrics",
+                        "query_dependencies",
+                        "query_deploys",
+                        "rollback_deploy",
+                        "restart_service",
+                        "run_check",
+                        "isolate_service",
+                        "escalate",
+                        "declare_resolved",
+                    ],
+                },
+            },
+            "required": ["root_cause", "affected_services", "confidence", "recommended_next_action"],
+            "additionalProperties": False,
+        },
+    }
+    required = ["action_type"]
+    for action, fields in observation.required_fields_by_action.items():
+        if action in observation.allowed_actions:
+            for field in fields:
+                if field not in required:
+                    required.append(field)
+    return {
+        "type": "object",
+        "properties": properties,
+        "required": required,
+        "additionalProperties": False,
+    }
+def request_action(client: OpenAI, observation: UnifiedIncidentObservation) -> str:
+    completion = client.chat.completions.create(
+        model=MODEL_NAME,
+        messages=[
+            {"role": "system", "content": "You are an incident responder. Respond with JSON only."},
+            {"role": "user", "content": build_user_prompt(observation)},
+        ],
+        response_format={
+            "type": "json_schema",
+            "json_schema": {
+                "name": "incident_action",
+                "strict": True,
+                "schema": _schema(observation),
+            },
+        },
+        max_tokens=MAX_TOKENS,
+        temperature=0.0,
+    )
+    return (completion.choices[0].message.content or "").strip()
+def build_fallback_action(observation: UnifiedIncidentObservation) -> UnifiedIncidentAction:
+    services = _service_order(observation)
+    if "query_deploys" in observation.allowed_actions and "worker" in observation.service_health:
+        return UnifiedIncidentAction(action_type="query_deploys", service="worker")
+    if "query_logs" in observation.allowed_actions:
+        return UnifiedIncidentAction(action_type="query_logs", service=services[0] if services else "database")
+    if "query_metrics" in observation.allowed_actions:
+        return UnifiedIncidentAction(action_type="query_metrics", service=services[0] if services else "database", metric="cpu")
+    action_type = observation.allowed_actions[0]
+    return UnifiedIncidentAction(**_default_action_for_type(action_type, observation))
+def get_model_action(client: OpenAI | None, observation: UnifiedIncidentObservation) -> tuple[UnifiedIncidentAction, str | None]:
+    if client is None:
+        return build_fallback_action(observation), "model_unavailable"
+    try:
+        parsed = parse_action(request_action(client, observation), observation)
+        if parsed is not None:
+            return parsed, None
+    except Exception:
+        pass
+    return build_fallback_action(observation), "fallback_used"
+def run_scenario(client: OpenAI | None, scenario_id: str) -> dict[str, Any]:
+    with UnifiedIncidentEnv(base_url=ENV_BASE_URL).sync() as env:
+        observation = env.reset(scenario_id=scenario_id).observation
+        rewards: list[float] = []
+        step = 0
+        log_start(task=scenario_id, env=ENV_NAME, model=MODEL_NAME)
+        while not observation.done:
+            step += 1
+            action, error = get_model_action(client, observation)
+            result = env.step(action)
+            observation = result.observation
+            rewards.append(float(result.reward))
+            log_step(
+                step=step,
+                action=json.dumps(action.model_dump(exclude_none=True), separators=(",", ":")),
+                reward=float(result.reward),
+                done=bool(result.done),
+                error=error or observation.failure_type,
+            )
+        log_end(
+            success=bool(observation.done and observation.incident_resolved),
+            steps=step,
+            score=observation.final_score,
+            rewards=rewards,
+        )
+        return {
+            "success": bool(observation.done and observation.incident_resolved),
+            "score": observation.final_score,
+            "steps": step,
+            "rewards": rewards,
+        }
+def main() -> None:
+    client = create_client()
+    for scenario_id in SCENARIOS:
+        run_scenario(client, scenario_id)
+if __name__ == "__main__":
+    main()

openenv.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+name: sre-engineer-llm
+version: 2.0.0
+description: >
+  Honest narrow OpenEnv benchmark for incident diagnosis and safe remediation.
+  Agents query evidence, choose bounded remediation actions, run explicit checks,
+  and declare resolution only after objective recovery succeeds.
+author: Daksh Verma
+license: MIT
+environment:
+  action_type: UnifiedIncidentAction
+  observation_type: UnifiedIncidentObservation
+  state_type: UnifiedIncidentState
+  max_steps: 12
+  difficulties: [easy, medium, hard]
+  reward_type: dense
+huggingface:
+  space_id: dakshdoesdev/sre-gym
+  sdk: docker
+  hardware: cpu-basic

pyproject.toml ADDED Viewed

	@@ -0,0 +1,43 @@

+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "unified-incident-env"
+version = "1.0.0"
+description = "Unified OpenEnv benchmark for incident response with causally linked WebSec remediation"
+readme = "README.md"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core>=0.2.1",
+    "fastapi>=0.115.0",
+    "uvicorn[standard]>=0.30.0",
+    "pydantic>=2.8.0",
+    "httpx>=0.27.0",
+    "openai>=1.0.0",
+    "websockets>=12.0",
+    "rich>=13.0.0",
+    "matplotlib>=3.9.0",
+    "numpy>=2.0.0"
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-asyncio>=0.23.0"
+]
+[project.scripts]
+server = "server.app:main"
+baseline = "unified_incident_env.scripts.baseline_agent:main"
+walkthrough = "unified_incident_env.scripts.walkthrough:main"
+trainer-run-episode = "unified_incident_env.trainer.run_episode:main"
+trainer-build-dataset = "unified_incident_env.trainer.build_sft_dataset:main"
+trainer-eval-models = "unified_incident_env.trainer.eval_models:main"
+trainer-build-datasets = "unified_incident_env.trainer.build_datasets:main"
+trainer-update-model = "unified_incident_env.trainer.update_model:main"
+trainer-run-session = "unified_incident_env.trainer.run_session:main"
+trainer-train-external = "unified_incident_env.trainer.train_external:main"
+[tool.hatch.build.targets.wheel]
+packages = ["unified_incident_env", "server"]

requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ -e .

run_demo.py ADDED Viewed

	@@ -0,0 +1,86 @@

+#!/usr/bin/env python3
+"""Run a local end-to-end benchmark demo against the OpenEnv server."""
+from __future__ import annotations
+import os
+import signal
+import subprocess
+import sys
+import time
+from pathlib import Path
+import httpx
+REPO_ROOT = Path(__file__).resolve().parent
+BASE_URL = os.getenv("ENV_BASE_URL", "http://127.0.0.1:8000")
+HEALTH_URL = f"{BASE_URL.rstrip('/')}/health"
+def server_is_ready() -> bool:
+    try:
+        response = httpx.get(HEALTH_URL, timeout=2.0)
+        return response.status_code == 200
+    except Exception:
+        return False
+def start_server() -> subprocess.Popen[str]:
+    return subprocess.Popen(
+        [
+            sys.executable,
+            "-m",
+            "uvicorn",
+            "server.app:app",
+            "--host",
+            "127.0.0.1",
+            "--port",
+            "8000",
+        ],
+        cwd=REPO_ROOT,
+        text=True,
+    )
+def wait_for_server(timeout_s: float = 20.0) -> None:
+    deadline = time.time() + timeout_s
+    while time.time() < deadline:
+        if server_is_ready():
+            return
+        time.sleep(0.5)
+    raise RuntimeError(f"Server did not become ready at {HEALTH_URL}")
+def stop_server(process: subprocess.Popen[str]) -> None:
+    if process.poll() is not None:
+        return
+    process.send_signal(signal.SIGTERM)
+    try:
+        process.wait(timeout=10)
+    except subprocess.TimeoutExpired:
+        process.kill()
+def main() -> None:
+    server_process: subprocess.Popen[str] | None = None
+    try:
+        if not server_is_ready():
+            server_process = start_server()
+            wait_for_server()
+        env = os.environ.copy()
+        env.setdefault("ENV_BASE_URL", BASE_URL)
+        subprocess.run(
+            [sys.executable, "inference.py"],
+            cwd=REPO_ROOT,
+            env=env,
+            check=True,
+        )
+    finally:
+        if server_process is not None:
+            stop_server(server_process)
+if __name__ == "__main__":
+    main()

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,21 @@

+FROM python:3.11-slim
+WORKDIR /app
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    ENABLE_WEB_INTERFACE=true
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+COPY . /app
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir .
+EXPOSE 8000
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """OpenEnv server wrapper package."""

server/app.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""Top-level OpenEnv entrypoint wrapper."""
+from unified_incident_env.server.app import app, serve
+from unified_incident_env.server.app import main as _main
+__all__ = ["app", "main", "serve"]
+def main() -> None:
+    _main()
+if __name__ == "__main__":
+    main()

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+openenv-core>=0.2.1
+fastapi>=0.115.0
+uvicorn[standard]>=0.30.0
+pydantic>=2.8.0
+websockets>=12.0
+openai>=1.0.0
+matplotlib>=3.9.0
+numpy>=2.0.0

skill/SKILL.md ADDED Viewed

	@@ -0,0 +1,100 @@

+---
+name: sre-gym
+description: SRE incident-response training environment with fault injection and deterministic grading. Use when the user wants to practice SRE skills, solve an injected production incident, or run one of three scenarios (worker_deploy_cascade / db_config_rollout / gateway_auth_rollout) against the sre-gym HTTP server. Invokes scripts in skill/tools/ to query the env and records verified runbooks after clean solves.
+---
+# SRE Gym — Incident Response Skill
+You are an SRE agent connected to a running sre-gym environment (HTTP, default `http://127.0.0.1:8000`). The env simulates production incidents with decoy services, deterministic grading, and explicit resolution checks. Your job is to diagnose from evidence, pick the correct remediation, verify recovery, then declare resolved.
+## When to use this skill
+- The user names a scenario (`worker_deploy_cascade`, `db_config_rollout`, `gateway_auth_rollout`) or says "solve an incident / run SRE scenario"
+- The user asks you to practice, benchmark, or demo incident response
+- The user points you at an sre-gym URL
+## Core rules (never break these)
+1. **Never guess at remediation.** Query evidence (`query_logs`, `query_deploys`, `query_metrics`) before `rollback_deploy` / `restart_service`.
+2. **Root cause before restart.** Restarting a service before rolling back the triggering change re-inherits the bad state.
+3. **Never call `declare_resolved` before the scenario's resolution check passes.** Each scenario specifies which check is required; read it from `observation.checks` and from any loaded runbook.
+4. **Watch for decoys.** Each scenario has a plausible-looking wrong answer. Example: `db_config_rollout` has a recent worker deploy that is *not* the cause. Read logs before committing to a target.
+5. **Repeating the same no-progress action wastes ticks.** The env emits `loop_warning` when you do this — treat it as a hard signal to try a different evidence source.
+## Workflow
+### 1. Load prior knowledge
+Before your first action, check `skill/verified-runbooks/{scenario_id}.md`. If it exists, read it — it's a log of previously-successful solves for this exact scenario, written by earlier runs of this skill. Use the winning path and the decoy list.
+### 2. Drive the env
+Use `skill/tools/sre_gym_client.py` to call the env:
+```bash
+python skill/tools/sre_gym_client.py list           # show available scenarios
+python skill/tools/sre_gym_client.py reset <id>     # start an episode
+python skill/tools/sre_gym_client.py step '<json>'  # take one action
+python skill/tools/sre_gym_client.py status         # current obs + grader
+```
+Action JSON matches the env's `UnifiedIncidentAction` model. Examples:
+```json
+{"action_type": "query_logs", "service": "database"}
+{"action_type": "query_deploys", "service": "worker"}
+{"action_type": "rollback_deploy", "service": "database"}
+{"action_type": "run_check", "check_name": "end_to_end"}
+{"action_type": "declare_resolved"}
+```
+### 3. Investigation loop (per tick)
+1. Read `observation.prompt_text` — services, alerts, last result, failure_type, why_failed.
+2. If `observation.failure_type` is set, your previous action was rejected — **do not repeat it**, read `why_failed` and pick a different evidence source or remediation.
+3. Form a hypothesis with `submit_hypothesis` once you have enough evidence (usually 2–4 queries). Calibrate `confidence`: ≥0.7 only if you're sure.
+4. Remediate (`rollback_deploy` → `restart_service` if scenario requires → `run_check`).
+5. `declare_resolved` only after the required check passes.
+### 4. Record the runbook
+If the episode finishes with `incident_resolved=true` and `final_score > 0.85`, run:
+```bash
+python skill/tools/sre_gym_client.py record-runbook <scenario_id>
+```
+This appends a new entry to `skill/verified-runbooks/{scenario_id}.md`. Future runs of this skill (yours or another Claude's) load it automatically.
+## Action reference (11 actions)
+| Action | Required fields | Purpose |
+|---|---|---|
+| `query_logs` | `service` | Read service-level error logs |
+| `query_metrics` | `service`, `metric` (cpu/error_rate/latency) | Read quantitative signals |
+| `query_dependencies` | `service` | Map upstream/downstream |
+| `query_deploys` | `service` | Recent deploy history |
+| `rollback_deploy` | `service` | Revert last deploy — SCENARIO-SPECIFIC TARGET |
+| `restart_service` | `service` | Reboot a service (usually after rollback) |
+| `run_check` | `check_name` (`database_recovery` / `end_to_end`) | Objective recovery check |
+| `isolate_service` | `service` | Containment only, does not resolve |
+| `escalate` | — | Record escalation note |
+| `submit_hypothesis` | `hypothesis` object | Commit RCA with confidence calibration |
+| `declare_resolved` | — | Finalize; rejected if required check has not passed |
+## Scoring rubric (deterministic from the env)
+- **Recovery (0–0.4):** services healthy on the critical path
+- **Containment (0–0.3):** root cause removed OR offending service isolated
+- **Verification (0–0.35):** both checks passed
+- **Impact (0–0.15):** user_impact reduced
+- **Efficiency (0–0.10):** budget preserved, no wasteful repeats
+Clean solve target: **> 0.85**. That's the runbook-record threshold.
+## Decoy knowledge (read before hypothesizing)
+- `worker_deploy_cascade`: the only true cause; no decoys.
+- `db_config_rollout`: the recent worker deploy is a **decoy**. Rolling back worker yields `wrong_remediation_target`.
+- `gateway_auth_rollout`: the recent worker deploy (`worker@...-hotfix` — log-format tweak) is a **decoy**. The gateway auth rollout is the cause.
+If you take a wrong remediation, the env returns `failure_type="wrong_remediation_target"` and a negative reward — **do not retry the same wrong target**, re-read the logs.

skill/tools/sre_gym_client.py ADDED Viewed

	@@ -0,0 +1,238 @@

+#!/usr/bin/env python3
+"""CLI client for the sre-gym skill.
+Usage:
+    sre_gym_client.py list
+    sre_gym_client.py solve <scenario_id> [--policy baseline]
+    sre_gym_client.py interactive <scenario_id>   # stdin: one JSON action per line
+    sre_gym_client.py record-runbook <scenario_id> <session.json>
+Because OpenEnv's HTTP /reset and /step handlers create a fresh environment per
+call, episode state only persists within a single client session. This CLI wraps
+one episode inside one Python process so the session is preserved.
+SRE_GYM_URL env var overrides the base URL (default http://127.0.0.1:8000).
+"""
+from __future__ import annotations
+import datetime as _dt
+import json
+import os
+import sys
+from pathlib import Path
+from typing import Any
+# Make the sibling package importable whether the script is invoked from the
+# repo root or from the skill/ directory directly.
+_REPO_ROOT = Path(__file__).resolve().parent.parent.parent
+if str(_REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT))
+from unified_incident_env.client import UnifiedIncidentEnv  # noqa: E402
+from unified_incident_env.models import UnifiedIncidentAction, UnifiedIncidentObservation  # noqa: E402
+from unified_incident_env.server.challenge import SCENARIOS, list_baselines  # noqa: E402
+BASE_URL = os.environ.get("SRE_GYM_URL", "http://127.0.0.1:8000").rstrip("/")
+RUNBOOK_DIR = Path(__file__).resolve().parent.parent / "verified-runbooks"
+SCORE_THRESHOLD = 0.85
+def _clean_action(action: UnifiedIncidentAction) -> dict[str, Any]:
+    data = action.model_dump(exclude_none=True)
+    if data.get("metadata") == {}:
+        data.pop("metadata")
+    hypothesis = data.get("hypothesis")
+    if isinstance(hypothesis, dict) and hypothesis.get("metadata") == {}:
+        hypothesis.pop("metadata", None)
+    return data
+def _summarize_obs(obs: UnifiedIncidentObservation) -> dict[str, Any]:
+    return {
+        "tick": obs.tick_count,
+        "workflow_stage": obs.workflow_stage,
+        "last_action_result": obs.last_action_result,
+        "tool_output": obs.tool_output,
+        "failure_type": obs.failure_type,
+        "why_failed": obs.why_failed,
+        "loop_warning": obs.loop_warning,
+        "checks": [{"name": c.name, "passed": c.passed} for c in obs.checks],
+        "final_score": obs.final_score,
+        "incident_resolved": obs.incident_resolved,
+    }
+def _session_path(scenario_id: str) -> Path:
+    return Path(f"/tmp/sre_gym_session.{scenario_id}.json")
+def cmd_list() -> None:
+    for scenario in SCENARIOS.values():
+        print(f"  {scenario['difficulty']:<6} {scenario['id']:<25} {scenario['name']}")
+def cmd_solve(scenario_id: str, policy: str = "baseline") -> None:
+    """Run an entire episode end-to-end inside one process."""
+    if scenario_id not in SCENARIOS:
+        print(f"error: unknown scenario {scenario_id!r}", file=sys.stderr)
+        sys.exit(2)
+    if policy != "baseline":
+        print(f"error: unknown policy {policy!r} (only 'baseline' available)", file=sys.stderr)
+        sys.exit(2)
+    trace: list[dict[str, Any]] = []
+    with UnifiedIncidentEnv(base_url=BASE_URL).sync() as env:
+        obs = env.reset(scenario_id=scenario_id).observation
+        print(f"[reset] scenario={scenario_id} difficulty={obs.difficulty}")
+        for step in list_baselines(scenario_id).baselines[0].actions:
+            result = env.step(step.action)
+            obs = result.observation
+            record = {
+                "step": obs.tick_count,
+                "action": _clean_action(step.action),
+                "rationale": step.rationale,
+                "reward": result.reward,
+                **_summarize_obs(obs),
+            }
+            trace.append(record)
+            action_repr = json.dumps(record["action"], separators=(",", ":"))
+            print(f"[step {obs.tick_count}] action={action_repr} reward={result.reward:+.2f} score={obs.final_score:.2f}")
+            if result.done:
+                break
+        final = _summarize_obs(obs)
+    _session_path(scenario_id).write_text(
+        json.dumps({"scenario_id": scenario_id, "trace": trace, "final": final}, indent=2),
+        encoding="utf-8",
+    )
+    print(
+        f"[done] resolved={final['incident_resolved']} score={final['final_score']:.2f} "
+        f"steps={final['tick']} session={_session_path(scenario_id)}"
+    )
+def cmd_interactive(scenario_id: str) -> None:
+    """One JSON action per stdin line. Preserves session for the whole process lifetime."""
+    if scenario_id not in SCENARIOS:
+        print(f"error: unknown scenario {scenario_id!r}", file=sys.stderr)
+        sys.exit(2)
+    trace: list[dict[str, Any]] = []
+    with UnifiedIncidentEnv(base_url=BASE_URL).sync() as env:
+        obs = env.reset(scenario_id=scenario_id).observation
+        print(json.dumps({"event": "reset", "scenario_id": scenario_id, "obs": _summarize_obs(obs)}), flush=True)
+        for line in sys.stdin:
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                data = json.loads(line)
+                action = UnifiedIncidentAction(**data)
+            except Exception as exc:
+                print(json.dumps({"event": "error", "detail": str(exc)}), flush=True)
+                continue
+            result = env.step(action)
+            obs = result.observation
+            record = {"step": obs.tick_count, "action": _clean_action(action), "reward": result.reward, **_summarize_obs(obs)}
+            trace.append(record)
+            print(json.dumps({"event": "step", **record}), flush=True)
+            if result.done:
+                print(json.dumps({"event": "done", "final": _summarize_obs(obs)}), flush=True)
+                break
+    _session_path(scenario_id).write_text(
+        json.dumps({"scenario_id": scenario_id, "trace": trace, "final": _summarize_obs(obs)}, indent=2),
+        encoding="utf-8",
+    )
+def cmd_record_runbook(scenario_id: str, session_file: str | None = None) -> None:
+    """Append a new runbook entry if the referenced session cleared the threshold."""
+    path = Path(session_file) if session_file else _session_path(scenario_id)
+    if not path.exists():
+        print(f"error: no session file at {path}", file=sys.stderr)
+        sys.exit(2)
+    session = json.loads(path.read_text(encoding="utf-8"))
+    final = session.get("final", {})
+    score = float(final.get("final_score", 0.0))
+    if not final.get("incident_resolved"):
+        print(f"skip: session not resolved (resolved={final.get('incident_resolved')})", file=sys.stderr)
+        sys.exit(1)
+    if score < SCORE_THRESHOLD:
+        print(f"skip: score {score:.2f} below runbook threshold {SCORE_THRESHOLD:.2f}", file=sys.stderr)
+        sys.exit(1)
+    RUNBOOK_DIR.mkdir(parents=True, exist_ok=True)
+    runbook_path = RUNBOOK_DIR / f"{scenario_id}.md"
+    timestamp = _dt.datetime.now(_dt.timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
+    steps = int(final.get("tick", 0))
+    checks_passed = [c["name"] for c in final.get("checks", []) if c.get("passed")]
+    trace = session.get("trace", [])
+    header = (
+        f"# verified-runbooks/{scenario_id}.md\n\n"
+        "Runbook entries are written by the sre-gym skill after a successful solve "
+        f"(incident_resolved=true and final_score > {SCORE_THRESHOLD:.2f}).\n"
+        "Each entry is immutable evidence — treat it as ground truth for the winning path.\n\n---\n"
+    )
+    lines = [f"\n## Run {timestamp} — Score {score:.2f}\n"]
+    lines.append(f"- Steps: **{steps}**")
+    lines.append(f"- Checks passed: {', '.join(checks_passed) or 'none'}")
+    lines.append("")
+    lines.append("**Winning path:**")
+    for entry in trace:
+        act = entry["action"]
+        action_type = act.get("action_type")
+        extras = ", ".join(
+            f"{k}={v if not isinstance(v, dict) else v.get('root_cause', v)}"
+            for k, v in act.items()
+            if k != "action_type" and v not in (None, {})
+        )
+        extra_str = f" ({extras})" if extras else ""
+        rationale = entry.get("rationale", "").rstrip(".")
+        lines.append(f"{entry['step']}. `{action_type}{extra_str}` — {rationale}")
+    lines.append("")
+    entry_text = "\n".join(lines)
+    if not runbook_path.exists():
+        runbook_path.write_text(header + entry_text, encoding="utf-8")
+    else:
+        with runbook_path.open("a", encoding="utf-8") as f:
+            f.write(entry_text)
+    print(f"recorded runbook entry → {runbook_path} (score {score:.2f}, {steps} steps)")
+def main() -> None:
+    argv = sys.argv[1:]
+    if not argv:
+        print(__doc__, file=sys.stderr)
+        sys.exit(2)
+    cmd, *rest = argv
+    if cmd == "list":
+        cmd_list()
+    elif cmd == "solve":
+        if not rest:
+            print("error: solve requires <scenario_id>", file=sys.stderr)
+            sys.exit(2)
+        cmd_solve(rest[0], rest[1] if len(rest) > 1 else "baseline")
+    elif cmd == "interactive":
+        if not rest:
+            print("error: interactive requires <scenario_id>", file=sys.stderr)
+            sys.exit(2)
+        cmd_interactive(rest[0])
+    elif cmd == "record-runbook":
+        if not rest:
+            print("error: record-runbook requires <scenario_id>", file=sys.stderr)
+            sys.exit(2)
+        cmd_record_runbook(rest[0], rest[1] if len(rest) > 1 else None)
+    else:
+        print(f"error: unknown command {cmd!r}", file=sys.stderr)
+        print(__doc__, file=sys.stderr)
+        sys.exit(2)
+if __name__ == "__main__":
+    main()

skill/verified-runbooks/.gitkeep ADDED Viewed

File without changes

skill/verified-runbooks/db_config_rollout.md ADDED Viewed

	@@ -0,0 +1,23 @@

+# verified-runbooks/db_config_rollout.md
+Runbook entries are written by the sre-gym skill after a successful solve (incident_resolved=true and final_score > 0.85).
+Each entry is immutable evidence — treat it as ground truth for the winning path.
+---
+## Run 2026-04-23T22:01:33Z — Score 0.99
+- Steps: **10**
+- Checks passed: database_recovery, end_to_end
+**Winning path:**
+1. `query_logs (service=database)` — Database is the loudest alert; inspect logs for the actual error signature
+2. `query_deploys (service=database)` — Pool-acquire errors suggest a config change; check recent database rollouts
+3. `query_metrics (service=database, metric=error_rate)` — Confirm the error pattern is pool exhaustion rather than compute overload
+4. `query_logs (service=worker)` — Rule out the decoy worker deploy by reading worker logs directly
+5. `submit_hypothesis (hypothesis=database_only_failure)` — Localize the fault to the database config before remediating
+6. `rollback_deploy (service=database)` — Roll back the offending database config rollout
+7. `restart_service (service=database)` — Restart the database cleanly against the restored pool config
+8. `run_check (check_name=database_recovery)` — Verify database pool health and write latency are back within SLO
+9. `run_check (check_name=end_to_end)` — Verify gateway write-path traffic succeeds end-to-end
+10. `declare_resolved` — Declare resolved only after objective checks pass

skill/verified-runbooks/gateway_auth_rollout.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# verified-runbooks/gateway_auth_rollout.md
+Runbook entries are written by the sre-gym skill after a successful solve (incident_resolved=true and final_score > 0.85).
+Each entry is immutable evidence — treat it as ground truth for the winning path.
+---
+## Run 2026-04-23T22:01:37Z — Score 0.99
+- Steps: **8**
+- Checks passed: database_recovery, end_to_end
+**Winning path:**
+1. `query_logs (service=api-gateway)` — Gateway is rejecting logins; read gateway logs to localize the rejection class
+2. `query_deploys (service=api-gateway)` — Login rejection aligns with a recent auth middleware rollout; confirm deploy timing
+3. `query_deploys (service=worker)` — Rule out the worker deploy explicitly rather than assuming
+4. `submit_hypothesis (hypothesis=api_gateway_fault)` — Commit a calibrated hypothesis localizing to the gateway auth rollout
+5. `rollback_deploy (service=api-gateway)` — Roll back the bad auth middleware rollout; no restart needed
+6. `run_check (check_name=end_to_end)` — Verify that gateway login traffic now succeeds end-to-end
+7. `run_check (check_name=database_recovery)` — Confirm the database is (and stayed) healthy throughout
+8. `declare_resolved` — Declare resolved only after objective checks pass

skill/verified-runbooks/worker_deploy_cascade.md ADDED Viewed

	@@ -0,0 +1,23 @@

+# verified-runbooks/worker_deploy_cascade.md
+Runbook entries are written by the sre-gym skill after a successful solve (incident_resolved=true and final_score > 0.85).
+Each entry is immutable evidence — treat it as ground truth for the winning path.
+---
+## Run 2026-04-23T22:01:29Z — Score 0.99
+- Steps: **10**
+- Checks passed: database_recovery, end_to_end
+**Winning path:**
+1. `query_deploys (service=worker)` — Check whether any recent deploy aligns with the incident start
+2. `query_logs (service=worker)` — Inspect worker logs because deploy timing and queue pressure suggest worker-originated harm
+3. `query_metrics (service=database, metric=cpu)` — Confirm that the database is overloaded as a downstream effect
+4. `query_dependencies (service=api-gateway)` — Verify the gateway depends on the worker and database path
+5. `submit_hypothesis (hypothesis=bad_worker_deploy)` — Commit a calibrated hypothesis before taking an invasive mitigation step
+6. `rollback_deploy (service=worker)` — Remove the triggering change before restarting downstream services
+7. `restart_service (service=database)` — Bring the database back cleanly after the root cause is removed
+8. `run_check (check_name=database_recovery)` — Verify the database is no longer crashing
+9. `run_check (check_name=end_to_end)` — Verify gateway traffic succeeds end-to-end
+10. `declare_resolved` — Declare resolved only after objective checks pass

unified_incident_env/README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+# Unified Incident Env
+The runnable submission surface lives at the project root. This package contains the actual environment implementation:
+- typed models in `models.py`
+- environment logic in `server/environment.py`
+- scoring in `server/grader.py`
+- scenario catalog in `server/challenge.py`
+Use the root `README.md` for run commands, scoring, and example interaction.

unified_incident_env/__init__.py ADDED Viewed

	@@ -0,0 +1,17 @@

+"""Unified incident-response OpenEnv package."""
+from .interface import (
+    UnifiedIncidentAction,
+    UnifiedIncidentEnv,
+    UnifiedIncidentEnvironment,
+    UnifiedIncidentObservation,
+    UnifiedIncidentState,
+)
+__all__ = [
+    "UnifiedIncidentAction",
+    "UnifiedIncidentEnv",
+    "UnifiedIncidentEnvironment",
+    "UnifiedIncidentObservation",
+    "UnifiedIncidentState",
+]

unified_incident_env/client.py ADDED Viewed

	@@ -0,0 +1,35 @@

+"""Typed OpenEnv client for the unified incident environment."""
+from __future__ import annotations
+from typing import Any
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from .models import UnifiedIncidentAction, UnifiedIncidentObservation, UnifiedIncidentState
+class UnifiedIncidentEnv(
+    EnvClient[UnifiedIncidentAction, UnifiedIncidentObservation, UnifiedIncidentState]
+):
+    """Typed client wrapper around the OpenEnv HTTP API."""
+    DEFAULT_BASE_URL = "http://127.0.0.1:8000"
+    def _step_payload(self, action: UnifiedIncidentAction) -> dict[str, Any]:
+        return action.model_dump(exclude_none=True)
+    def _parse_result(self, payload: dict[str, Any]) -> StepResult[UnifiedIncidentObservation]:
+        observation_data = dict(payload.get("observation", {}))
+        observation_data.setdefault("reward", payload.get("reward", 0.0))
+        observation_data.setdefault("done", payload.get("done", False))
+        observation = UnifiedIncidentObservation.model_validate(observation_data)
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward", observation.reward),
+            done=payload.get("done", observation.done),
+        )
+    def _parse_state(self, payload: dict[str, Any]) -> UnifiedIncidentState:
+        return UnifiedIncidentState.model_validate(payload)

unified_incident_env/interface.py ADDED Viewed

	@@ -0,0 +1,17 @@

+"""Single public interface surface for the unified incident benchmark."""
+from .client import UnifiedIncidentEnv
+from .models import (
+    UnifiedIncidentAction,
+    UnifiedIncidentObservation,
+    UnifiedIncidentState,
+)
+from .server.environment import UnifiedIncidentEnvironment
+__all__ = [
+    "UnifiedIncidentAction",
+    "UnifiedIncidentEnv",
+    "UnifiedIncidentEnvironment",
+    "UnifiedIncidentObservation",
+    "UnifiedIncidentState",
+]

unified_incident_env/models.py ADDED Viewed

	@@ -0,0 +1,332 @@

+"""Typed models for the honest narrow incident-remediation environment."""
+from __future__ import annotations
+from typing import Any, Literal
+from openenv.core import Action, Observation, State
+from pydantic import BaseModel, ConfigDict, Field, model_validator
+from pydantic_core import PydanticCustomError
+ActionType = Literal[
+    "query_logs",
+    "query_metrics",
+    "query_dependencies",
+    "query_deploys",
+    "rollback_deploy",
+    "restart_service",
+    "run_check",
+    "isolate_service",
+    "escalate",
+    "submit_hypothesis",
+    "declare_resolved",
+]
+Difficulty = Literal["easy", "medium", "hard"]
+MetricName = Literal["cpu", "error_rate", "latency"]
+ServiceName = Literal["api-gateway", "cache", "database", "worker"]
+ServiceStatus = Literal["healthy", "degraded", "crashed", "isolated"]
+WorkflowStage = Literal["triage", "mitigation", "validation", "resolved"]
+CheckName = Literal["database_recovery", "end_to_end"]
+RootCauseType = Literal[
+    "bad_worker_deploy",
+    "database_only_failure",
+    "api_gateway_fault",
+]
+RecommendedActionType = Literal[
+    "query_logs",
+    "query_metrics",
+    "query_dependencies",
+    "query_deploys",
+    "rollback_deploy",
+    "restart_service",
+    "run_check",
+    "isolate_service",
+    "escalate",
+    "declare_resolved",
+]
+class PostmortemPayload(BaseModel):
+    """Deprecated compatibility shell for the removed v1 postmortem action."""
+    model_config = ConfigDict(extra="forbid")
+    root_cause: str = ""
+    attack_vector: str = ""
+    timeline: list[str] = Field(default_factory=list)
+    remediation_steps: list[str] = Field(default_factory=list)
+    prevention_steps: list[str] = Field(default_factory=list)
+class SecurityContext(BaseModel):
+    """Deprecated compatibility shell for the removed v1 security subquest state."""
+    model_config = ConfigDict(extra="forbid")
+    code_visible: bool = False
+    selected_vulnerability: str | None = None
+    selected_patch: str | None = None
+    exploit_blocked: bool | None = None
+    functionality_preserved: bool | None = None
+class HypothesisPayload(BaseModel):
+    """Structured hypothesis submitted by the agent."""
+    model_config = ConfigDict(extra="forbid")
+    root_cause: RootCauseType
+    affected_services: list[ServiceName] = Field(default_factory=list, min_length=1)
+    confidence: float = Field(ge=0.0, le=1.0)
+    recommended_next_action: RecommendedActionType
+class ServiceHealth(BaseModel):
+    """Health snapshot for a service."""
+    model_config = ConfigDict(extra="forbid")
+    name: ServiceName
+    status: ServiceStatus
+    cpu_pct: float = Field(ge=0.0, le=100.0)
+    memory_pct: float = Field(ge=0.0, le=100.0)
+    error_rate_pct: float = Field(ge=0.0, le=100.0)
+    latency_ms: float = Field(ge=0.0)
+class Alert(BaseModel):
+    """Alert exposed to the agent."""
+    model_config = ConfigDict(extra="forbid")
+    service: ServiceName
+    severity: Literal["warning", "critical"]
+    message: str
+class CheckResult(BaseModel):
+    """Result of a verification check."""
+    model_config = ConfigDict(extra="forbid")
+    name: CheckName
+    passed: bool
+    detail: str
+class UnifiedIncidentAction(Action):
+    """One structured environment action."""
+    model_config = ConfigDict(extra="ignore")
+    action_type: ActionType
+    service: ServiceName | None = None
+    metric: MetricName | None = None
+    check_name: CheckName | None = None
+    hypothesis: HypothesisPayload | None = None
+    @model_validator(mode="after")
+    def _validate_payload(self) -> "UnifiedIncidentAction":
+        if self.action_type in {
+            "query_logs",
+            "query_dependencies",
+            "query_deploys",
+            "rollback_deploy",
+            "restart_service",
+            "isolate_service",
+        } and not self.service:
+            raise PydanticCustomError(
+                "missing_service",
+                "service is required for {action_type}",
+                {"action_type": self.action_type},
+            )
+        if self.action_type == "query_metrics":
+            if not self.service:
+                raise PydanticCustomError(
+                    "missing_service",
+                    "service is required for {action_type}",
+                    {"action_type": self.action_type},
+                )
+            if not self.metric:
+                raise PydanticCustomError(
+                    "missing_metric",
+                    "metric is required for {action_type}",
+                    {"action_type": self.action_type},
+                )
+        if self.action_type == "run_check" and not self.check_name:
+            raise PydanticCustomError(
+                "missing_check_name",
+                "check_name is required for {action_type}",
+                {"action_type": self.action_type},
+            )
+        if self.action_type == "submit_hypothesis" and self.hypothesis is None:
+            raise PydanticCustomError(
+                "missing_hypothesis",
+                "hypothesis is required for {action_type}",
+                {"action_type": self.action_type},
+            )
+        return self
+class UnifiedIncidentObservation(Observation):
+    """Observation returned after reset and each step."""
+    model_config = ConfigDict(extra="forbid")
+    prompt_text: str
+    incident_summary: str
+    tick_count: int
+    max_ticks: int
+    difficulty: Difficulty
+    workflow_stage: WorkflowStage
+    active_alerts: list[Alert] = Field(default_factory=list)
+    service_health: dict[str, ServiceHealth] = Field(default_factory=dict)
+    discovered_evidence: list[str] = Field(default_factory=list)
+    recent_deploys: list[str] = Field(default_factory=list)
+    checks: list[CheckResult] = Field(default_factory=list)
+    user_impact: float = Field(ge=0.0, le=1.0)
+    slo_burn_rate: float = Field(ge=0.0, le=1.0)
+    incident_resolved: bool = False
+    containment_applied: bool = False
+    last_action_result: str = ""
+    tool_output: str | None = None
+    failure_type: str | None = None
+    why_failed: str | None = None
+    allowed_actions: list[str] = Field(default_factory=list)
+    required_fields_by_action: dict[str, list[str]] = Field(default_factory=dict)
+    valid_action_example: dict[str, Any] | None = None
+    common_trap: str | None = None
+    loop_warning: str | None = None
+    blocked_until_security_complete: bool = False
+    security_unlock_reason: str | None = None
+    best_recovery_action_family: str | None = None
+    progress_flags: dict[str, bool] = Field(default_factory=dict)
+    security_subquest_status: str | None = None
+    security_context: dict[str, Any] = Field(default_factory=dict)
+    final_score: float = 0.0
+    score_breakdown: dict[str, float] = Field(default_factory=dict)
+    reward: float = 0.0
+    done: bool = False
+class UnifiedIncidentState(State):
+    """Persistent episode state."""
+    model_config = ConfigDict(extra="forbid")
+    episode_id: str
+    step_count: int
+    scenario_id: str
+    difficulty: Difficulty
+    current_tick: int
+    max_ticks: int
+    workflow_stage: WorkflowStage
+    active_alerts: list[Alert] = Field(default_factory=list)
+    service_health: dict[str, ServiceHealth] = Field(default_factory=dict)
+    discovered_evidence: list[str] = Field(default_factory=list)
+    recent_deploys: list[str] = Field(default_factory=list)
+    checks: list[CheckResult] = Field(default_factory=list)
+    user_impact: float = Field(ge=0.0, le=1.0)
+    slo_burn_rate: float = Field(ge=0.0, le=1.0)
+    incident_resolved: bool = False
+    containment_applied: bool = False
+    allowed_actions: list[str] = Field(default_factory=list)
+    required_fields_by_action: dict[str, list[str]] = Field(default_factory=dict)
+    valid_action_example: dict[str, Any] | None = None
+    progress_flags: dict[str, bool] = Field(default_factory=dict)
+    final_score: float = 0.0
+    score_breakdown: dict[str, float] = Field(default_factory=dict)
+    cumulative_reward: float = 0.0
+    wasteful_ticks: int = 0
+    last_action_result: str = ""
+    failure_type: str | None = None
+    why_failed: str | None = None
+class ScenarioSummary(BaseModel):
+    """Public scenario summary."""
+    model_config = ConfigDict(extra="forbid")
+    id: str
+    difficulty: Difficulty
+    name: str
+    description: str
+    root_cause: str
+    optimal_ticks: int
+class ScenarioCatalog(BaseModel):
+    """Public scenario catalog."""
+    model_config = ConfigDict(extra="forbid")
+    environment: str = "unified_incident_env"
+    default_scenario_id: str
+    available_difficulties: list[Difficulty]
+    filtered_difficulty: Difficulty | None = None
+    scenarios: list[ScenarioSummary]
+class BaselineStep(BaseModel):
+    """One baseline action."""
+    model_config = ConfigDict(extra="forbid")
+    action: UnifiedIncidentAction
+    rationale: str = ""
+class BaselineDefinition(BaseModel):
+    """One baseline trajectory."""
+    model_config = ConfigDict(extra="forbid")
+    scenario_id: str
+    name: str
+    description: str
+    optimal_ticks: int
+    actions: list[BaselineStep] = Field(default_factory=list)
+class BaselineCatalog(BaseModel):
+    """Public baseline catalog."""
+    model_config = ConfigDict(extra="forbid")
+    environment: str = "unified_incident_env"
+    baselines: list[BaselineDefinition]
+class GraderCheck(BaseModel):
+    """One normalized grader check."""
+    model_config = ConfigDict(extra="forbid")
+    name: str
+    passed: bool
+    detail: str
+    weight: float
+class GraderReport(BaseModel):
+    """Episode-grade report."""
+    model_config = ConfigDict(extra="forbid")
+    scenario_id: str
+    passed: bool
+    score: float = Field(ge=0.0, le=1.0)
+    message: str
+    breakdown: dict[str, float] = Field(default_factory=dict)
+    checks: list[GraderCheck] = Field(default_factory=list)
+class RuntimeStatus(BaseModel):
+    """Runtime status route payload."""
+    model_config = ConfigDict(extra="forbid")
+    environment: str = "unified_incident_env"
+    progress: UnifiedIncidentState
+    grader: GraderReport

unified_incident_env/scripts/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Scripts for the unified incident environment."""

unified_incident_env/scripts/baseline_agent.py ADDED Viewed

	@@ -0,0 +1,43 @@

+"""Deterministic scripted baseline for the honest narrow incident environment."""
+from __future__ import annotations
+import argparse
+import json
+from ..client import UnifiedIncidentEnv
+from ..server.challenge import DEFAULT_SCENARIO_ID, SCENARIOS, list_baselines
+def plan_for_scenario(scenario_id: str):
+    catalog = list_baselines(scenario_id)
+    return [step.action for step in catalog.baselines[0].actions]
+def run_scenario(base_url: str, scenario_id: str) -> dict[str, object]:
+    with UnifiedIncidentEnv(base_url=base_url).sync() as env:
+        env.reset(scenario_id=scenario_id)
+        final = None
+        for action in plan_for_scenario(scenario_id):
+            final = env.step(action).observation
+        assert final is not None
+        return {
+            "scenario_id": scenario_id,
+            "success": bool(final.done and final.incident_resolved),
+            "final_score": final.final_score,
+            "workflow_stage": final.workflow_stage,
+        }
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base-url", default=UnifiedIncidentEnv.DEFAULT_BASE_URL)
+    parser.add_argument("--scenario", choices=sorted(SCENARIOS), default=DEFAULT_SCENARIO_ID)
+    args = parser.parse_args()
+    results = [run_scenario(args.base_url, args.scenario)]
+    print(json.dumps(results, indent=2))
+if __name__ == "__main__":
+    main()

unified_incident_env/scripts/walkthrough.py ADDED Viewed

	@@ -0,0 +1,41 @@

+"""Simple walkthrough that prints a full episode interaction."""
+from __future__ import annotations
+import argparse
+import json
+from ..client import UnifiedIncidentEnv
+from .baseline_agent import plan_for_scenario
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--base-url",
+        default=UnifiedIncidentEnv.DEFAULT_BASE_URL,
+    )
+    parser.add_argument(
+        "--scenario",
+        default="easy_sqli_db_outage",
+    )
+    args = parser.parse_args()
+    with UnifiedIncidentEnv(base_url=args.base_url).sync() as env:
+        reset = env.reset(scenario_id=args.scenario).observation
+        print(json.dumps({"reset": reset.model_dump()}, indent=2))
+        for action in plan_for_scenario(args.scenario):
+            step = env.step(action).observation
+            print(
+                json.dumps(
+                    {
+                        "action": action.model_dump(exclude_none=True),
+                        "observation": step.model_dump(),
+                    },
+                    indent=2,
+                )
+            )
+if __name__ == "__main__":
+    main()

unified_incident_env/server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Server package for the unified incident environment."""

unified_incident_env/server/app.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""FastAPI app and metadata routes for the honest narrow incident environment."""
+from __future__ import annotations
+import argparse
+import os
+from typing import Any
+from fastapi import HTTPException
+from fastapi.responses import HTMLResponse, RedirectResponse
+from openenv.core.env_server.http_server import create_fastapi_app
+from ..models import (
+    BaselineCatalog,
+    GraderReport,
+    RuntimeStatus,
+    ScenarioCatalog,
+    UnifiedIncidentAction,
+    UnifiedIncidentObservation,
+    UnifiedIncidentState,
+)
+from .challenge import current_runtime_progress, grade_episode, list_baselines, list_scenarios, set_runtime_progress
+from .environment import UnifiedIncidentEnvironment
+_BOOTSTRAP_ENV = UnifiedIncidentEnvironment()
+set_runtime_progress(_BOOTSTRAP_ENV.state.model_dump())
+_SIMPLE_HTML = """<!doctype html>
+<html lang="en">
+<head>
+  <meta charset="utf-8" />
+  <meta name="viewport" content="width=device-width, initial-scale=1" />
+  <title>Unified Incident Env</title>
+  <style>
+    body { font-family: system-ui, sans-serif; max-width: 900px; margin: 40px auto; padding: 0 20px; line-height: 1.5; }
+    code, pre { background: #f4f4f4; padding: 2px 6px; border-radius: 6px; }
+    pre { padding: 12px; overflow: auto; }
+  </style>
+</head>
+<body>
+  <h1>Unified Incident Env</h1>
+  <p>This v2 environment exposes an honest bounded-action incident diagnosis and remediation task.</p>
+  <ul>
+    <li><a href="/docs">API docs</a></li>
+    <li><a href="/tasks">Scenario catalog</a></li>
+    <li><a href="/baseline">Baseline plan</a></li>
+    <li><a href="/status">Runtime status</a></li>
+    <li><a href="/health">Health</a></li>
+  </ul>
+  <h2>Core ideas</h2>
+  <ul>
+    <li>Queries reveal evidence but do not directly mint positive reward.</li>
+    <li>Remediation actions change the world state.</li>
+    <li><code>run_check</code> verifies recovery explicitly.</li>
+    <li><code>declare_resolved</code> succeeds only after objective checks pass.</li>
+  </ul>
+  <h2>Manual example</h2>
+  <pre>curl -X POST http://127.0.0.1:8000/reset -H 'content-type: application/json' -d '{}'
+curl -X POST http://127.0.0.1:8000/step -H 'content-type: application/json' -d '{"action_type":"query_deploys","service":"worker"}'</pre>
+</body>
+</html>
+"""
+def create_compatible_app():
+    env_factory = lambda: UnifiedIncidentEnvironment()
+    app = create_fastapi_app(
+        env_factory,
+        UnifiedIncidentAction,
+        UnifiedIncidentObservation,
+        max_concurrent_envs=1,
+    )
+    @app.get("/", include_in_schema=False)
+    async def web_root():
+        return RedirectResponse(url="/simple")
+    @app.get("/simple", include_in_schema=False)
+    async def simple_console():
+        return HTMLResponse(_SIMPLE_HTML)
+    _attach_metadata_routes(app)
+    return app
+def _attach_metadata_routes(app):
+    @app.get("/tasks", response_model=ScenarioCatalog, tags=["challenge"])
+    def tasks(difficulty: str | None = None) -> ScenarioCatalog:
+        try:
+            return list_scenarios(difficulty=difficulty)
+        except ValueError as exc:
+            raise HTTPException(status_code=404, detail=str(exc)) from exc
+    @app.get("/baseline", response_model=BaselineCatalog, tags=["challenge"])
+    def baseline(scenario_id: str | None = None) -> BaselineCatalog:
+        try:
+            return list_baselines(scenario_id=scenario_id)
+        except ValueError as exc:
+            raise HTTPException(status_code=404, detail=str(exc)) from exc
+    @app.get("/grader", response_model=GraderReport, tags=["challenge"])
+    def grader(scenario_id: str | None = None) -> GraderReport:
+        progress = current_runtime_progress()
+        if scenario_id is not None:
+            progress["scenario_id"] = scenario_id
+        try:
+            return grade_episode(progress)
+        except ValueError as exc:
+            raise HTTPException(status_code=404, detail=str(exc)) from exc
+    @app.get("/status", response_model=RuntimeStatus, tags=["challenge"])
+    def status() -> RuntimeStatus:
+        progress = current_runtime_progress()
+        return RuntimeStatus(
+            progress=UnifiedIncidentState(**progress),
+            grader=grade_episode(progress),
+        )
+    @app.get("/health", tags=["challenge"])
+    def health() -> dict[str, object]:
+        return {
+            "status": "ok",
+            "environment": "unified_incident_env",
+            "version": "2.0.0",
+            "stages": ["triage", "mitigation", "validation", "resolved"],
+        }
+app = create_compatible_app()
+def serve(host: str = "0.0.0.0", port: int = 8000) -> None:
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--host", default=os.environ.get("HOST", "0.0.0.0"))
+    parser.add_argument("--port", type=int, default=int(os.environ.get("PORT", "8000")))
+    args = parser.parse_args()
+    serve(host=args.host, port=args.port)
+if __name__ == "__main__":
+    main()

unified_incident_env/server/challenge.py ADDED Viewed

	@@ -0,0 +1,753 @@

+"""Scenario catalog, baselines, and runtime helpers for the honest v2 core."""
+from __future__ import annotations
+from copy import deepcopy
+from typing import Any
+from ..models import (
+    BaselineCatalog,
+    BaselineDefinition,
+    BaselineStep,
+    ScenarioCatalog,
+    ScenarioSummary,
+    UnifiedIncidentAction,
+)
+DEFAULT_SCENARIO_ID = "worker_deploy_cascade"
+SCENARIOS: dict[str, dict[str, Any]] = {
+    "worker_deploy_cascade": {
+        "id": "worker_deploy_cascade",
+        "difficulty": "easy",
+        "name": "Worker Deploy Cascade",
+        "description": (
+            "A bad worker deploy causes sustained database overload and login 502s at the gateway. "
+            "The agent must diagnose from evidence, choose a safe remediation, verify recovery, and declare resolved only after checks pass."
+        ),
+        "root_cause": "A bad worker deploy is driving repeated database overload.",
+        "optimal_ticks": 10,
+        "max_ticks": 12,
+        "critical_service_weights": {
+            "worker": 0.4,
+            "database": 0.4,
+            "api-gateway": 0.2,
+            "cache": 0.0,
+        },
+        "reward_config": {
+            "step_cost": 0.01,
+            "redundant_action_penalty": 0.02,
+            "unsafe_action_penalty": 0.08,
+            "premature_resolution_penalty": 0.2,
+            "successful_resolution_bonus": 0.25,
+            "hypothesis_bonus_scale": 0.12,
+            "forbidden_reward_sources": [
+                "evidence_discovery",
+                "query_success",
+                "unlock_events",
+                "stage_advancement",
+                "patch_id_selection",
+            ],
+        },
+        "initial_services": {
+            "api-gateway": {
+                "status": "degraded",
+                "cpu_pct": 61.0,
+                "memory_pct": 38.0,
+                "error_rate_pct": 24.0,
+                "latency_ms": 640.0,
+            },
+            "cache": {
+                "status": "healthy",
+                "cpu_pct": 18.0,
+                "memory_pct": 24.0,
+                "error_rate_pct": 0.0,
+                "latency_ms": 14.0,
+            },
+            "database": {
+                "status": "crashed",
+                "cpu_pct": 99.0,
+                "memory_pct": 97.0,
+                "error_rate_pct": 100.0,
+                "latency_ms": 0.0,
+            },
+            "worker": {
+                "status": "degraded",
+                "cpu_pct": 88.0,
+                "memory_pct": 71.0,
+                "error_rate_pct": 19.0,
+                "latency_ms": 420.0,
+            },
+        },
+        "initial_alerts": [
+            {
+                "service": "api-gateway",
+                "severity": "critical",
+                "message": "Login requests are returning sustained 502s.",
+            },
+            {
+                "service": "database",
+                "severity": "critical",
+                "message": "Database process is crashing under repeated overload.",
+            },
+            {
+                "service": "worker",
+                "severity": "warning",
+                "message": "Worker queue depth and retry volume spiked after a recent rollout.",
+            },
+        ],
+        "logs": {
+            "api-gateway": (
+                "Gateway upstream errors point to worker timeouts followed by database connection failures. "
+                "No recent gateway deploys are recorded."
+            ),
+            "cache": "Cache hit ratio is stable and cache upstream probes remain healthy.",
+            "database": (
+                "Database logs show repeated bursts of expensive worker-originated writes immediately before each crash."
+            ),
+            "worker": (
+                "Worker logs show request fanout amplification and elevated retries beginning right after rollout build worker@2026.04.23-bad."
+            ),
+        },
+        "metrics": {
+            "api-gateway": {
+                "error_rate": "Gateway 502 rate is 24% and closely tracks worker timeout bursts.",
+                "latency": "Gateway p95 latency climbed to 640ms while waiting on downstream worker/database calls.",
+            },
+            "database": {
+                "cpu": "Database CPU is pinned at 99% until the process exits.",
+                "latency": "Database latency spikes sharply before each crash loop.",
+            },
+            "worker": {
+                "cpu": "Worker CPU is 88% with growing queue pressure.",
+                "error_rate": "Worker retry/error rate is elevated after rollout.",
+            },
+        },
+        "dependencies": {
+            "api-gateway": "api-gateway -> worker -> database",
+            "worker": "worker -> database",
+            "database": "database is a terminal dependency for write-heavy worker jobs",
+        },
+        "deploy_history": {
+            "api-gateway": "No gateway deploys in the last 24h.",
+            "cache": "No cache deploys in the last 24h.",
+            "database": "No database deploys in the last 24h.",
+            "worker": "Rolled out worker@2026.04.23-bad 12 minutes ago.",
+        },
+        "checks": {
+            "database_recovery": "Confirms the database is healthy and no longer crashing.",
+            "end_to_end": "Confirms login traffic succeeds without worker-induced overload.",
+        },
+        "truth": {
+            "root_cause": "bad_worker_deploy",
+            "affected_services": ["worker", "database", "api-gateway"],
+            "best_next_action": "rollback_deploy",
+        },
+        "remediation_recipe": {
+            "rollback_target": "worker",
+            "restart_target": "database",
+            "isolate_target": "worker",
+            "restart_requires_cause_removed": True,
+            "incident_driver": "worker",
+            "resolution_check": "end_to_end",
+        },
+        "post_rollback_services": {
+            "worker": {"status": "healthy", "cpu_pct": 32.0, "memory_pct": 37.0, "error_rate_pct": 2.0, "latency_ms": 40.0},
+        },
+        "post_rollback_user_impact": 0.55,
+        "post_rollback_slo_burn": 0.58,
+        "post_restart_services": {
+            "database": {"status": "healthy", "cpu_pct": 34.0, "memory_pct": 39.0, "error_rate_pct": 0.0, "latency_ms": 22.0},
+            "api-gateway": {"status": "healthy", "cpu_pct": 28.0, "memory_pct": 31.0, "error_rate_pct": 0.0, "latency_ms": 38.0},
+        },
+        "post_restart_user_impact": 0.14,
+        "post_restart_slo_burn": 0.18,
+        "post_isolate_services": {
+            "worker": {"status": "isolated", "cpu_pct": 8.0, "memory_pct": 18.0, "error_rate_pct": 0.0, "latency_ms": 0.0},
+            "database": {"status": "healthy", "cpu_pct": 41.0, "memory_pct": 46.0, "error_rate_pct": 0.0, "latency_ms": 26.0},
+            "api-gateway": {"status": "degraded", "cpu_pct": 34.0, "memory_pct": 33.0, "error_rate_pct": 7.0, "latency_ms": 91.0},
+        },
+        "post_isolate_user_impact": 0.45,
+        "post_isolate_slo_burn": 0.47,
+        "degraded_services": {
+            "worker": {"status": "degraded", "cpu_pct": 88.0, "memory_pct": 71.0, "error_rate_pct": 19.0, "latency_ms": 420.0},
+            "database": {"status": "crashed", "cpu_pct": 99.0, "memory_pct": 97.0, "error_rate_pct": 100.0, "latency_ms": 0.0},
+            "api-gateway": {"status": "degraded", "cpu_pct": 61.0, "memory_pct": 38.0, "error_rate_pct": 24.0, "latency_ms": 640.0},
+        },
+        "degraded_user_impact": 0.82,
+        "degraded_slo_burn": 0.91,
+        "failure_messages": {
+            "wrong_rollback_target": "Rolling back a service without a causal link wastes time and risk.",
+            "low_value_restart": "Restarting that service is not the safe next remediation step for this incident.",
+            "premature_restart": "Restarting before removing the trigger only causes another crash loop.",
+            "wrong_isolation_target": "Isolating that service does not contain the dominant failure path.",
+        },
+    },
+    "db_config_rollout": {
+        "id": "db_config_rollout",
+        "difficulty": "medium",
+        "name": "Database Config Rollout Regression",
+        "description": (
+            "A database config push cut connection pool size and write requests now time out. "
+            "A separate worker deploy landed around the same time and looks suspicious but is not the cause. "
+            "The agent must avoid the decoy, roll back the database config, restart it, and verify recovery."
+        ),
+        "root_cause": "A bad database config rollout shrank the connection pool and is dropping writes.",
+        "optimal_ticks": 10,
+        "max_ticks": 12,
+        "critical_service_weights": {
+            "worker": 0.2,
+            "database": 0.5,
+            "api-gateway": 0.3,
+            "cache": 0.0,
+        },
+        "reward_config": {
+            "step_cost": 0.01,
+            "redundant_action_penalty": 0.02,
+            "unsafe_action_penalty": 0.08,
+            "premature_resolution_penalty": 0.2,
+            "successful_resolution_bonus": 0.25,
+            "hypothesis_bonus_scale": 0.12,
+            "forbidden_reward_sources": [
+                "evidence_discovery",
+                "query_success",
+                "unlock_events",
+                "stage_advancement",
+                "patch_id_selection",
+            ],
+        },
+        "initial_services": {
+            "api-gateway": {
+                "status": "degraded",
+                "cpu_pct": 44.0,
+                "memory_pct": 36.0,
+                "error_rate_pct": 17.0,
+                "latency_ms": 520.0,
+            },
+            "cache": {
+                "status": "healthy",
+                "cpu_pct": 20.0,
+                "memory_pct": 26.0,
+                "error_rate_pct": 0.0,
+                "latency_ms": 15.0,
+            },
+            "database": {
+                "status": "degraded",
+                "cpu_pct": 62.0,
+                "memory_pct": 54.0,
+                "error_rate_pct": 48.0,
+                "latency_ms": 880.0,
+            },
+            "worker": {
+                "status": "degraded",
+                "cpu_pct": 51.0,
+                "memory_pct": 44.0,
+                "error_rate_pct": 12.0,
+                "latency_ms": 310.0,
+            },
+        },
+        "initial_alerts": [
+            {
+                "service": "database",
+                "severity": "critical",
+                "message": "Database connection acquire timeouts at 48% and climbing.",
+            },
+            {
+                "service": "api-gateway",
+                "severity": "warning",
+                "message": "Write-path requests are returning sustained 5xx.",
+            },
+            {
+                "service": "worker",
+                "severity": "warning",
+                "message": "Worker write latency is elevated; retries are climbing.",
+            },
+        ],
+        "logs": {
+            "api-gateway": (
+                "Gateway upstream errors are downstream-driven: writes to the worker path return pool-exhaustion "
+                "errors originating from the database. No gateway deploys recorded in the last 24h."
+            ),
+            "cache": "Cache reads are healthy and unrelated to the current write-path failures.",
+            "database": (
+                "Database logs show 'could not acquire connection' errors immediately after config rollout "
+                "db@2026.04.24-cfg lowered max_connections from 80 to 12."
+            ),
+            "worker": (
+                "Worker logs show retries driven by downstream database pool exhaustion, not local faults. "
+                "Worker code deploy worker@2026.04.24-refactor is unrelated to the pool error signature."
+            ),
+        },
+        "metrics": {
+            "api-gateway": {
+                "error_rate": "Gateway 5xx rate is 17% and matches the database pool-exhaustion windows one-for-one.",
+                "latency": "Gateway p95 climbed to 520ms waiting on database connection acquire.",
+            },
+            "database": {
+                "cpu": "Database CPU is moderate (~62%), so this is not a compute overload pattern.",
+                "error_rate": "Database error rate is 48% and dominated by 'connection acquire timeout'.",
+                "latency": "Database write latency jumped to 880ms after the config rollout.",
+            },
+            "worker": {
+                "cpu": "Worker CPU is 51% — no local overload; retries are reactive.",
+                "error_rate": "Worker errors are retries against the saturated database pool.",
+            },
+        },
+        "dependencies": {
+            "api-gateway": "api-gateway -> worker -> database",
+            "worker": "worker -> database",
+            "database": "database is the terminal dependency; pool exhaustion here starves all upstream writers",
+        },
+        "deploy_history": {
+            "api-gateway": "No gateway deploys in the last 24h.",
+            "cache": "No cache deploys in the last 24h.",
+            "database": "Applied config db@2026.04.24-cfg 15 minutes ago (max_connections 80 -> 12).",
+            "worker": "Rolled out worker@2026.04.24-refactor 22 minutes ago (unrelated code cleanup).",
+        },
+        "checks": {
+            "database_recovery": "Confirms database write latency and pool health are back within SLO.",
+            "end_to_end": "Confirms gateway write-path traffic succeeds end-to-end.",
+        },
+        "truth": {
+            "root_cause": "database_only_failure",
+            "affected_services": ["database", "api-gateway", "worker"],
+            "best_next_action": "rollback_deploy",
+        },
+        "remediation_recipe": {
+            "rollback_target": "database",
+            "restart_target": "database",
+            "isolate_target": None,
+            "restart_requires_cause_removed": True,
+            "incident_driver": "database",
+            "resolution_check": "end_to_end",
+        },
+        "post_rollback_services": {
+            "database": {"status": "degraded", "cpu_pct": 48.0, "memory_pct": 42.0, "error_rate_pct": 6.0, "latency_ms": 120.0},
+        },
+        "post_rollback_user_impact": 0.40,
+        "post_rollback_slo_burn": 0.45,
+        "post_restart_services": {
+            "database": {"status": "healthy", "cpu_pct": 36.0, "memory_pct": 40.0, "error_rate_pct": 0.0, "latency_ms": 26.0},
+            "api-gateway": {"status": "healthy", "cpu_pct": 29.0, "memory_pct": 30.0, "error_rate_pct": 0.0, "latency_ms": 44.0},
+            "worker": {"status": "healthy", "cpu_pct": 33.0, "memory_pct": 36.0, "error_rate_pct": 1.0, "latency_ms": 48.0},
+        },
+        "post_restart_user_impact": 0.10,
+        "post_restart_slo_burn": 0.14,
+        "post_isolate_services": {},
+        "post_isolate_user_impact": 0.70,
+        "post_isolate_slo_burn": 0.75,
+        "degraded_services": {
+            "database": {"status": "degraded", "cpu_pct": 62.0, "memory_pct": 54.0, "error_rate_pct": 48.0, "latency_ms": 880.0},
+            "api-gateway": {"status": "degraded", "cpu_pct": 44.0, "memory_pct": 36.0, "error_rate_pct": 17.0, "latency_ms": 520.0},
+            "worker": {"status": "degraded", "cpu_pct": 51.0, "memory_pct": 44.0, "error_rate_pct": 12.0, "latency_ms": 310.0},
+        },
+        "degraded_user_impact": 0.70,
+        "degraded_slo_burn": 0.78,
+        "failure_messages": {
+            "wrong_rollback_target": "The worker deploy is a decoy; worker errors are reactive to database pool exhaustion.",
+            "low_value_restart": "Restarting that service does not address a database-config regression.",
+            "premature_restart": "Restarting the database before rolling back the config will re-inherit the 12-connection pool and fail again.",
+            "wrong_isolation_target": "Isolation is not useful here: the cause is a config regression, not a runaway service.",
+        },
+    },
+    "gateway_auth_rollout": {
+        "id": "gateway_auth_rollout",
+        "difficulty": "hard",
+        "name": "Gateway Auth Rollout Regression",
+        "description": (
+            "A new api-gateway auth-middleware rollout is rejecting ~40% of valid logins. "
+            "A recent worker deploy and elevated worker queue depth make the worker look like a plausible suspect. "
+            "The agent must localize to the gateway, roll back its deploy, and verify recovery without unnecessary restarts."
+        ),
+        "root_cause": "A bad api-gateway auth-middleware rollout is rejecting valid logins.",
+        "optimal_ticks": 8,
+        "max_ticks": 10,
+        "critical_service_weights": {
+            "worker": 0.15,
+            "database": 0.15,
+            "api-gateway": 0.70,
+            "cache": 0.0,
+        },
+        "reward_config": {
+            "step_cost": 0.01,
+            "redundant_action_penalty": 0.02,
+            "unsafe_action_penalty": 0.12,
+            "premature_resolution_penalty": 0.3,
+            "successful_resolution_bonus": 0.3,
+            "hypothesis_bonus_scale": 0.12,
+            "forbidden_reward_sources": [
+                "evidence_discovery",
+                "query_success",
+                "unlock_events",
+                "stage_advancement",
+                "patch_id_selection",
+            ],
+        },
+        "initial_services": {
+            "api-gateway": {
+                "status": "degraded",
+                "cpu_pct": 38.0,
+                "memory_pct": 42.0,
+                "error_rate_pct": 41.0,
+                "latency_ms": 180.0,
+            },
+            "cache": {
+                "status": "healthy",
+                "cpu_pct": 17.0,
+                "memory_pct": 23.0,
+                "error_rate_pct": 0.0,
+                "latency_ms": 12.0,
+            },
+            "database": {
+                "status": "healthy",
+                "cpu_pct": 38.0,
+                "memory_pct": 41.0,
+                "error_rate_pct": 1.0,
+                "latency_ms": 28.0,
+            },
+            "worker": {
+                "status": "degraded",
+                "cpu_pct": 63.0,
+                "memory_pct": 48.0,
+                "error_rate_pct": 4.0,
+                "latency_ms": 220.0,
+            },
+        },
+        "initial_alerts": [
+            {
+                "service": "api-gateway",
+                "severity": "critical",
+                "message": "Gateway is returning 401 on ~40% of valid login attempts.",
+            },
+            {
+                "service": "worker",
+                "severity": "warning",
+                "message": "Worker queue depth is elevated from the retry storm upstream.",
+            },
+        ],
+        "logs": {
+            "api-gateway": (
+                "Gateway logs show auth-middleware rejecting tokens with valid signatures. "
+                "Rejection rate started exactly at the gateway@2026.04.24-auth rollout boundary."
+            ),
+            "cache": "Cache hit ratio stable and unrelated.",
+            "database": "Database logs are clean; no increase in errors or latency.",
+            "worker": (
+                "Worker logs show client-side retry storms triggered by upstream 401s, not local faults. "
+                "Worker deploy worker@2026.04.24-hotfix is a log-format tweak and does not touch auth."
+            ),
+        },
+        "metrics": {
+            "api-gateway": {
+                "error_rate": "Gateway error rate is 41%, dominated by 401 responses (auth failures).",
+                "latency": "Gateway latency is normal — errors are fast rejections, not timeouts.",
+            },
+            "database": {
+                "cpu": "Database CPU is 38% (normal).",
+                "error_rate": "Database error rate is ~1% and flat.",
+            },
+            "worker": {
+                "cpu": "Worker CPU is 63% from retry volume, not workload.",
+                "error_rate": "Worker errors are reactive retries, not primary failures.",
+            },
+        },
+        "dependencies": {
+            "api-gateway": "api-gateway -> (auth) -> worker -> database",
+            "worker": "worker -> database",
+            "database": "database is healthy; it is not on the fault path",
+        },
+        "deploy_history": {
+            "api-gateway": "Rolled out gateway@2026.04.24-auth 9 minutes ago (auth middleware rewrite).",
+            "cache": "No cache deploys in the last 24h.",
+            "database": "No database deploys in the last 24h.",
+            "worker": "Rolled out worker@2026.04.24-hotfix 18 minutes ago (log-format tweak, no auth changes).",
+        },
+        "checks": {
+            "database_recovery": "Confirms the database is healthy (always healthy in this scenario).",
+            "end_to_end": "Confirms gateway login traffic succeeds end-to-end.",
+        },
+        "truth": {
+            "root_cause": "api_gateway_fault",
+            "affected_services": ["api-gateway", "worker"],
+            "best_next_action": "rollback_deploy",
+        },
+        "remediation_recipe": {
+            "rollback_target": "api-gateway",
+            "restart_target": None,
+            "isolate_target": "api-gateway",
+            "restart_requires_cause_removed": True,
+            "incident_driver": "api-gateway",
+            "resolution_check": "end_to_end",
+        },
+        "post_rollback_services": {
+            "api-gateway": {"status": "healthy", "cpu_pct": 30.0, "memory_pct": 34.0, "error_rate_pct": 1.0, "latency_ms": 38.0},
+            "worker": {"status": "healthy", "cpu_pct": 34.0, "memory_pct": 36.0, "error_rate_pct": 1.0, "latency_ms": 52.0},
+        },
+        "post_rollback_user_impact": 0.12,
+        "post_rollback_slo_burn": 0.18,
+        "post_restart_services": {},
+        "post_restart_user_impact": 0.12,
+        "post_restart_slo_burn": 0.18,
+        "post_isolate_services": {
+            "api-gateway": {"status": "isolated", "cpu_pct": 6.0, "memory_pct": 14.0, "error_rate_pct": 0.0, "latency_ms": 0.0},
+        },
+        "post_isolate_user_impact": 0.55,
+        "post_isolate_slo_burn": 0.60,
+        "degraded_services": {
+            "api-gateway": {"status": "degraded", "cpu_pct": 38.0, "memory_pct": 42.0, "error_rate_pct": 41.0, "latency_ms": 180.0},
+            "worker": {"status": "degraded", "cpu_pct": 63.0, "memory_pct": 48.0, "error_rate_pct": 4.0, "latency_ms": 220.0},
+        },
+        "degraded_user_impact": 0.65,
+        "degraded_slo_burn": 0.72,
+        "failure_messages": {
+            "wrong_rollback_target": "The worker deploy is a log-format tweak and is not on the auth fault path.",
+            "low_value_restart": "Restarting a service does not fix a config/middleware regression rolled out as a deploy.",
+            "premature_restart": "Restarting before rolling back the gateway auth change just restarts the same bad middleware.",
+            "wrong_isolation_target": "Isolating workers or database cuts healthy traffic without fixing the gateway auth fault.",
+        },
+    },
+}
+_RUNTIME_PROGRESS: dict[str, Any] | None = None
+def get_scenario(scenario_id: str) -> dict[str, Any]:
+    if scenario_id not in SCENARIOS:
+        raise ValueError(f"Unknown scenario_id {scenario_id!r}")
+    return deepcopy(SCENARIOS[scenario_id])
+SUPPORTED_DIFFICULTIES: tuple[str, ...] = ("easy", "medium", "hard")
+def scenario_for_difficulty(difficulty: str) -> dict[str, Any]:
+    for scenario in SCENARIOS.values():
+        if scenario["difficulty"] == difficulty:
+            return deepcopy(scenario)
+    raise ValueError(f"Unknown difficulty {difficulty!r}")
+def list_scenarios(difficulty: str | None = None) -> ScenarioCatalog:
+    if difficulty is not None and difficulty not in SUPPORTED_DIFFICULTIES:
+        raise ValueError(f"Unknown difficulty {difficulty!r}")
+    scenarios = [
+        ScenarioSummary(
+            id=scenario["id"],
+            difficulty=scenario["difficulty"],
+            name=scenario["name"],
+            description=scenario["description"],
+            root_cause=scenario["root_cause"],
+            optimal_ticks=scenario["optimal_ticks"],
+        )
+        for scenario in SCENARIOS.values()
+        if difficulty is None or scenario["difficulty"] == difficulty
+    ]
+    return ScenarioCatalog(
+        default_scenario_id=DEFAULT_SCENARIO_ID,
+        available_difficulties=list(SUPPORTED_DIFFICULTIES),
+        filtered_difficulty=difficulty,
+        scenarios=scenarios,
+    )
+def _worker_cascade_baseline() -> list[BaselineStep]:
+    return [
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_deploys", service="worker"),
+            rationale="Check whether any recent deploy aligns with the incident start.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_logs", service="worker"),
+            rationale="Inspect worker logs because deploy timing and queue pressure suggest worker-originated harm.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_metrics", service="database", metric="cpu"),
+            rationale="Confirm that the database is overloaded as a downstream effect.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_dependencies", service="api-gateway"),
+            rationale="Verify the gateway depends on the worker and database path.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(
+                action_type="submit_hypothesis",
+                hypothesis={
+                    "root_cause": "bad_worker_deploy",
+                    "affected_services": ["worker", "database", "api-gateway"],
+                    "confidence": 0.82,
+                    "recommended_next_action": "rollback_deploy",
+                },
+            ),
+            rationale="Commit a calibrated hypothesis before taking an invasive mitigation step.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="rollback_deploy", service="worker"),
+            rationale="Remove the triggering change before restarting downstream services.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="restart_service", service="database"),
+            rationale="Bring the database back cleanly after the root cause is removed.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="run_check", check_name="database_recovery"),
+            rationale="Verify the database is no longer crashing.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"),
+            rationale="Verify gateway traffic succeeds end-to-end.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="declare_resolved"),
+            rationale="Declare resolved only after objective checks pass.",
+        ),
+    ]
+def _db_config_rollout_baseline() -> list[BaselineStep]:
+    return [
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_logs", service="database"),
+            rationale="Database is the loudest alert; inspect logs for the actual error signature.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_deploys", service="database"),
+            rationale="Pool-acquire errors suggest a config change; check recent database rollouts.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_metrics", service="database", metric="error_rate"),
+            rationale="Confirm the error pattern is pool exhaustion rather than compute overload.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_logs", service="worker"),
+            rationale="Rule out the decoy worker deploy by reading worker logs directly.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(
+                action_type="submit_hypothesis",
+                hypothesis={
+                    "root_cause": "database_only_failure",
+                    "affected_services": ["database", "api-gateway", "worker"],
+                    "confidence": 0.8,
+                    "recommended_next_action": "rollback_deploy",
+                },
+            ),
+            rationale="Localize the fault to the database config before remediating.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="rollback_deploy", service="database"),
+            rationale="Roll back the offending database config rollout.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="restart_service", service="database"),
+            rationale="Restart the database cleanly against the restored pool config.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="run_check", check_name="database_recovery"),
+            rationale="Verify database pool health and write latency are back within SLO.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"),
+            rationale="Verify gateway write-path traffic succeeds end-to-end.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="declare_resolved"),
+            rationale="Declare resolved only after objective checks pass.",
+        ),
+    ]
+def _gateway_auth_rollout_baseline() -> list[BaselineStep]:
+    return [
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_logs", service="api-gateway"),
+            rationale="Gateway is rejecting logins; read gateway logs to localize the rejection class.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_deploys", service="api-gateway"),
+            rationale="Login rejection aligns with a recent auth middleware rollout; confirm deploy timing.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_deploys", service="worker"),
+            rationale="Rule out the worker deploy explicitly rather than assuming.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(
+                action_type="submit_hypothesis",
+                hypothesis={
+                    "root_cause": "api_gateway_fault",
+                    "affected_services": ["api-gateway", "worker"],
+                    "confidence": 0.85,
+                    "recommended_next_action": "rollback_deploy",
+                },
+            ),
+            rationale="Commit a calibrated hypothesis localizing to the gateway auth rollout.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="rollback_deploy", service="api-gateway"),
+            rationale="Roll back the bad auth middleware rollout; no restart needed.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"),
+            rationale="Verify that gateway login traffic now succeeds end-to-end.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="run_check", check_name="database_recovery"),
+            rationale="Confirm the database is (and stayed) healthy throughout.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="declare_resolved"),
+            rationale="Declare resolved only after objective checks pass.",
+        ),
+    ]
+_BASELINE_BUILDERS = {
+    "worker_deploy_cascade": _worker_cascade_baseline,
+    "db_config_rollout": _db_config_rollout_baseline,
+    "gateway_auth_rollout": _gateway_auth_rollout_baseline,
+}
+def _baseline_actions(scenario_id: str) -> list[BaselineStep]:
+    builder = _BASELINE_BUILDERS.get(scenario_id)
+    if builder is None:
+        raise ValueError(f"No baseline for scenario_id {scenario_id!r}")
+    return builder()
+def list_baselines(scenario_id: str | None = None) -> BaselineCatalog:
+    if scenario_id is not None:
+        if scenario_id not in SCENARIOS:
+            raise ValueError(f"Unknown scenario_id {scenario_id!r}")
+        scenario_ids = [scenario_id]
+    else:
+        scenario_ids = list(SCENARIOS.keys())
+    baselines = [
+        BaselineDefinition(
+            scenario_id=current_id,
+            name="deterministic-remediation-baseline",
+            description=SCENARIOS[current_id]["description"],
+            optimal_ticks=SCENARIOS[current_id]["optimal_ticks"],
+            actions=_baseline_actions(current_id),
+        )
+        for current_id in scenario_ids
+    ]
+    return BaselineCatalog(baselines=baselines)
+def set_runtime_progress(progress: dict[str, Any]) -> None:
+    global _RUNTIME_PROGRESS
+    _RUNTIME_PROGRESS = deepcopy(progress)
+def current_runtime_progress() -> dict[str, Any]:
+    if _RUNTIME_PROGRESS is None:
+        raise ValueError("Runtime progress is not initialized")
+    return deepcopy(_RUNTIME_PROGRESS)
+def grade_episode(state: dict[str, Any]):
+    from .grader import UnifiedIncidentGrader
+    scenario_id = state.get("scenario_id", DEFAULT_SCENARIO_ID)
+    return UnifiedIncidentGrader().build_report(state, get_scenario(scenario_id))

unified_incident_env/server/environment.py ADDED Viewed

	@@ -0,0 +1,613 @@

+"""Honest narrow incident-remediation environment core."""
+from __future__ import annotations
+import json
+import uuid
+from typing import Any
+from openenv.core.env_server import Environment
+from openenv.core.env_server.types import EnvironmentMetadata
+from ..models import (
+    Alert,
+    CheckResult,
+    ServiceHealth,
+    UnifiedIncidentAction,
+    UnifiedIncidentObservation,
+    UnifiedIncidentState,
+)
+from .challenge import DEFAULT_SCENARIO_ID, SCENARIOS, get_scenario, scenario_for_difficulty, set_runtime_progress
+from .grader import UnifiedIncidentGrader
+SERVICE_ORDER = ("api-gateway", "cache", "database", "worker")
+ALL_ACTIONS = [
+    "query_logs",
+    "query_metrics",
+    "query_dependencies",
+    "query_deploys",
+    "rollback_deploy",
+    "restart_service",
+    "run_check",
+    "isolate_service",
+    "escalate",
+    "submit_hypothesis",
+    "declare_resolved",
+]
+REQUIRED_FIELDS_BY_ACTION: dict[str, list[str]] = {
+    "query_logs": ["service"],
+    "query_metrics": ["service", "metric"],
+    "query_dependencies": ["service"],
+    "query_deploys": ["service"],
+    "rollback_deploy": ["service"],
+    "restart_service": ["service"],
+    "run_check": ["check_name"],
+    "isolate_service": ["service"],
+    "escalate": [],
+    "submit_hypothesis": ["hypothesis"],
+    "declare_resolved": [],
+}
+STATUS_VALUES = {
+    "healthy": 1.0,
+    "degraded": 0.4,
+    "crashed": 0.0,
+    "isolated": 0.2,
+}
+class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncidentObservation, UnifiedIncidentState]):
+    """A bounded-action incident diagnosis and safe remediation environment."""
+    SUPPORTS_CONCURRENT_SESSIONS = False
+    def __init__(self) -> None:
+        super().__init__()
+        self._grader = UnifiedIncidentGrader()
+        self._episode = self._make_episode(get_scenario(DEFAULT_SCENARIO_ID))
+        set_runtime_progress(self._state_dict())
+    def get_metadata(self) -> EnvironmentMetadata:
+        return EnvironmentMetadata(
+            name="unified_incident_env",
+            description=(
+                "A narrow incident diagnosis and safe remediation environment with bounded actions, "
+                "world-state transitions, explicit checks, and effect-based rewards."
+            ),
+            version="2.0.0",
+            author="Daksh Verma",
+        )
+    def reset(self, seed: int | None = None, episode_id: str | None = None, **kwargs: Any) -> UnifiedIncidentObservation:
+        del seed
+        scenario_id = kwargs.get("scenario_id")
+        difficulty = kwargs.get("difficulty")
+        if scenario_id:
+            scenario = get_scenario(scenario_id)
+        elif difficulty:
+            scenario = scenario_for_difficulty(difficulty)
+        else:
+            scenario = get_scenario(DEFAULT_SCENARIO_ID)
+        self._episode = self._make_episode(scenario, episode_id=episode_id)
+        set_runtime_progress(self._state_dict())
+        return self._build_observation(
+            last_action_result="Episode reset.",
+            tool_output=None,
+            reward=0.0,
+            done=False,
+        )
+    def step(self, action: UnifiedIncidentAction | dict[str, Any], timeout_s: float | None = None, **kwargs: Any) -> UnifiedIncidentObservation:
+        del timeout_s, kwargs
+        if isinstance(action, dict):
+            action = UnifiedIncidentAction(**action)
+        if self._episode["done"]:
+            return self._build_observation(
+                last_action_result="Episode complete. Reset to start another run.",
+                tool_output=None,
+                reward=0.0,
+                done=True,
+            )
+        self._episode["tick"] += 1
+        self._episode["step_count"] += 1
+        before_potential = self._incident_health_potential()
+        base_step_cost = float(self._episode["scenario"]["reward_config"]["step_cost"])
+        penalty = 0.0
+        bonus = 0.0
+        tool_output: str | None = None
+        state_changed = False
+        useful_observation = False
+        self._episode["failure_type"] = None
+        self._episode["why_failed"] = None
+        self._episode["loop_warning"] = None
+        if action.action_type == "query_logs":
+            tool_output = self._query_logs(action.service)
+            useful_observation = self._mark_evidence_once(f"logs:{action.service}", tool_output)
+            last_action_result = f"Queried logs for {action.service}."
+        elif action.action_type == "query_metrics":
+            tool_output = self._query_metrics(action.service, action.metric)
+            useful_observation = self._mark_evidence_once(f"metrics:{action.service}:{action.metric}", tool_output)
+            last_action_result = f"Queried {action.metric} for {action.service}."
+        elif action.action_type == "query_dependencies":
+            tool_output = self._query_dependencies(action.service)
+            useful_observation = self._mark_evidence_once(f"deps:{action.service}", tool_output)
+            last_action_result = f"Queried dependencies for {action.service}."
+        elif action.action_type == "query_deploys":
+            tool_output = self._query_deploys(action.service)
+            useful_observation = self._mark_evidence_once(f"deploys:{action.service}", tool_output)
+            last_action_result = f"Queried deploy history for {action.service}."
+        elif action.action_type == "submit_hypothesis":
+            bonus, useful_observation, last_action_result = self._submit_hypothesis(action)
+        elif action.action_type == "rollback_deploy":
+            state_changed, penalty, last_action_result = self._rollback_deploy(action.service)
+        elif action.action_type == "restart_service":
+            state_changed, penalty, last_action_result = self._restart_service(action.service)
+        elif action.action_type == "isolate_service":
+            state_changed, penalty, last_action_result = self._isolate_service(action.service)
+        elif action.action_type == "run_check":
+            tool_output, useful_observation, last_action_result = self._run_check(action.check_name)
+        elif action.action_type == "escalate":
+            useful_observation = self._mark_evidence_once(
+                f"escalate:{self._episode['tick']}",
+                "Escalation note recorded: expert attention requested while keeping the environment state unchanged.",
+            )
+            last_action_result = "Escalated for human attention."
+            tool_output = "Escalation does not fix the incident, but records that expert attention was requested."
+        elif action.action_type == "declare_resolved":
+            resolved, penalty, bonus, last_action_result = self._declare_resolved()
+            state_changed = resolved
+        else:
+            last_action_result = f"Unsupported action {action.action_type!r}."
+            penalty += self._unsafe_penalty()
+            self._set_failure("unsupported_action", "That action is not part of this honest narrow environment.")
+        self._advance_world()
+        self._refresh_alerts()
+        self._update_loop_feedback(action, useful_observation or state_changed)
+        after_potential = self._incident_health_potential()
+        reward = -base_step_cost + (after_potential - before_potential) + bonus - penalty
+        if not useful_observation and not state_changed and bonus <= 0.0:
+            self._episode["wasteful_ticks"] += 1
+        if self._episode["tick"] >= self._episode["max_ticks"] and not self._episode["done"]:
+            self._episode["done"] = True
+            last_action_result = f"{last_action_result} Tick budget exhausted.".strip()
+        self._episode["last_action_result"] = last_action_result
+        self._episode["workflow_stage"] = self._workflow_stage()
+        self._episode["score_breakdown"] = self._grader.compute_breakdown(self._state_dict(), self._episode["scenario"])
+        self._episode["final_score"] = self._episode["score_breakdown"]["final_score"]
+        self._episode["cumulative_reward"] = round(self._episode["cumulative_reward"] + reward, 4)
+        set_runtime_progress(self._state_dict())
+        return self._build_observation(
+            last_action_result=last_action_result,
+            tool_output=tool_output,
+            reward=round(reward, 4),
+            done=self._episode["done"],
+        )
+    @property
+    def state(self) -> UnifiedIncidentState:
+        return UnifiedIncidentState(**self._state_dict())
+    def _make_episode(self, scenario: dict[str, Any], episode_id: str | None = None) -> dict[str, Any]:
+        services = {
+            name: ServiceHealth(name=name, **payload)
+            for name, payload in scenario["initial_services"].items()
+        }
+        checks = {
+            "database_recovery": CheckResult(name="database_recovery", passed=False, detail="Database recovery has not been verified yet."),
+            "end_to_end": CheckResult(name="end_to_end", passed=False, detail="End-to-end health has not been verified yet."),
+        }
+        recipe = scenario.get("remediation_recipe", {})
+        rollback_target = recipe.get("rollback_target", "worker")
+        recent_deploy_service = rollback_target if rollback_target in scenario["deploy_history"] else "worker"
+        return {
+            "episode_id": episode_id or str(uuid.uuid4()),
+            "scenario": scenario,
+            "tick": 0,
+            "step_count": 0,
+            "max_ticks": scenario["max_ticks"],
+            "difficulty": scenario["difficulty"],
+            "services": services,
+            "alerts": [Alert(**payload) for payload in scenario["initial_alerts"]],
+            "discovered_evidence": [],
+            "evidence_seen": set(),
+            "recent_deploys": [scenario["deploy_history"].get(recent_deploy_service, "")],
+            "checks": checks,
+            "user_impact": scenario.get("degraded_user_impact", 0.82),
+            "slo_burn_rate": scenario.get("degraded_slo_burn", 0.91),
+            "containment_applied": False,
+            "cause_removed": False,
+            "isolated_service": None,
+            "hypothesis_seen": set(),
+            "failure_type": None,
+            "why_failed": None,
+            "loop_warning": None,
+            "last_action_key": None,
+            "repeat_count": 0,
+            "incident_resolved": False,
+            "workflow_stage": "triage",
+            "cumulative_reward": 0.0,
+            "wasteful_ticks": 0,
+            "score_breakdown": {
+                "recovery_score": 0.0,
+                "containment_score": 0.0,
+                "verification_score": 0.0,
+                "impact_score": 0.0,
+                "efficiency_score": 0.10,
+                "final_score": 0.10,
+            },
+            "final_score": 0.10,
+            "last_action_result": "",
+            "done": False,
+        }
+    def _query_logs(self, service: str | None) -> str:
+        assert service is not None
+        return self._episode["scenario"]["logs"][service]
+    def _query_metrics(self, service: str | None, metric: str | None) -> str:
+        assert service is not None and metric is not None
+        return self._episode["scenario"]["metrics"][service][metric]
+    def _query_dependencies(self, service: str | None) -> str:
+        assert service is not None
+        return self._episode["scenario"]["dependencies"][service]
+    def _query_deploys(self, service: str | None) -> str:
+        assert service is not None
+        return self._episode["scenario"]["deploy_history"][service]
+    def _submit_hypothesis(self, action: UnifiedIncidentAction) -> tuple[float, bool, str]:
+        assert action.hypothesis is not None
+        normalized = json.dumps(action.hypothesis.model_dump(), sort_keys=True)
+        if normalized in self._episode["hypothesis_seen"]:
+            return 0.0, False, "Repeated hypothesis recorded with no additional reward."
+        self._episode["hypothesis_seen"].add(normalized)
+        truth = self._episode["scenario"]["truth"]
+        payload = action.hypothesis
+        cause_match = 1.0 if payload.root_cause == truth["root_cause"] else 0.0
+        service_match = len(set(payload.affected_services) & set(truth["affected_services"])) / len(set(truth["affected_services"]))
+        action_quality = 1.0 if payload.recommended_next_action == truth["best_next_action"] else -0.4
+        if cause_match == 1.0:
+            calibration = 1.0 if payload.confidence >= 0.7 else 0.5
+        else:
+            calibration = -1.0 if payload.confidence >= 0.7 else -0.2
+        reward = (0.04 * cause_match) + (0.03 * service_match) + (0.03 * action_quality) + (0.02 * calibration)
+        return round(reward, 4), True, "Hypothesis recorded. Reward reflects root-cause accuracy, service localization, confidence calibration, and next-action quality."
+    def _recipe(self) -> dict[str, Any]:
+        return self._episode["scenario"].get("remediation_recipe", {})
+    def _failure_message(self, key: str, default: str) -> str:
+        return self._episode["scenario"].get("failure_messages", {}).get(key, default)
+    def _apply_service_updates(self, updates: dict[str, dict[str, Any]]) -> None:
+        for name, payload in updates.items():
+            self._episode["services"][name] = ServiceHealth(name=name, **payload)
+    def _rollback_deploy(self, service: str | None) -> tuple[bool, float, str]:
+        assert service is not None
+        recipe = self._recipe()
+        rollback_target = recipe.get("rollback_target")
+        if rollback_target is None or service != rollback_target:
+            self._set_failure(
+                "wrong_remediation_target",
+                self._failure_message("wrong_rollback_target", "Rolling back a service without a causal link wastes time and risk."),
+            )
+            return False, self._unsafe_penalty(), f"Rollback on {service} did not address the incident."
+        if self._episode["cause_removed"]:
+            return False, 0.0, f"{rollback_target} deploy is already rolled back."
+        self._episode["cause_removed"] = True
+        self._episode["containment_applied"] = True
+        self._apply_service_updates(self._episode["scenario"].get("post_rollback_services", {}))
+        scenario = self._episode["scenario"]
+        self._episode["user_impact"] = min(self._episode["user_impact"], scenario.get("post_rollback_user_impact", self._episode["user_impact"]))
+        self._episode["slo_burn_rate"] = min(self._episode["slo_burn_rate"], scenario.get("post_rollback_slo_burn", self._episode["slo_burn_rate"]))
+        return True, 0.0, f"Rolled back the {rollback_target} deploy; the underlying cause is removed."
+    def _restart_service(self, service: str | None) -> tuple[bool, float, str]:
+        assert service is not None
+        recipe = self._recipe()
+        restart_target = recipe.get("restart_target")
+        if restart_target is None or service != restart_target:
+            self._set_failure(
+                "low_value_restart",
+                self._failure_message("low_value_restart", f"Restarting {service} is not the safe next remediation step for this incident."),
+            )
+            return False, self._unsafe_penalty() / 2, f"Restarting {service} had little or no positive effect."
+        if recipe.get("restart_requires_cause_removed", True) and not self._episode["cause_removed"]:
+            self._set_failure(
+                "premature_restart",
+                self._failure_message("premature_restart", f"Restarting {service} before removing the trigger only causes another failure."),
+            )
+            return False, self._unsafe_penalty(), f"Restart of {service} failed because the triggering cause is still present."
+        self._apply_service_updates(self._episode["scenario"].get("post_restart_services", {}))
+        scenario = self._episode["scenario"]
+        self._episode["user_impact"] = scenario.get("post_restart_user_impact", self._episode["user_impact"])
+        self._episode["slo_burn_rate"] = scenario.get("post_restart_slo_burn", self._episode["slo_burn_rate"])
+        return True, 0.0, f"{service} restarted cleanly after the triggering cause was removed."
+    def _isolate_service(self, service: str | None) -> tuple[bool, float, str]:
+        assert service is not None
+        recipe = self._recipe()
+        isolate_target = recipe.get("isolate_target")
+        if isolate_target is None or service != isolate_target:
+            self._set_failure(
+                "wrong_isolation_target",
+                self._failure_message("wrong_isolation_target", f"Isolating {service} does not contain the dominant failure path."),
+            )
+            return False, self._unsafe_penalty() / 2, f"Isolation of {service} did not materially reduce blast radius."
+        if self._episode["isolated_service"] == isolate_target:
+            return False, 0.0, f"{isolate_target} is already isolated."
+        self._episode["isolated_service"] = isolate_target
+        self._episode["containment_applied"] = True
+        self._apply_service_updates(self._episode["scenario"].get("post_isolate_services", {}))
+        scenario = self._episode["scenario"]
+        self._episode["user_impact"] = scenario.get("post_isolate_user_impact", self._episode["user_impact"])
+        self._episode["slo_burn_rate"] = scenario.get("post_isolate_slo_burn", self._episode["slo_burn_rate"])
+        return True, 0.0, f"{isolate_target} isolated. Blast radius shrank, but full resolution still requires addressing the root cause."
+    def _run_check(self, check_name: str | None) -> tuple[str, bool, str]:
+        assert check_name is not None
+        recipe = self._recipe()
+        isolated = self._episode["isolated_service"]
+        cause_removed = self._episode["cause_removed"]
+        services = self._episode["services"]
+        if check_name == "database_recovery":
+            db_healthy = services["database"].status == "healthy"
+            incident_driver = recipe.get("incident_driver")
+            if incident_driver in {"worker", "database"}:
+                passed = db_healthy and cause_removed
+            else:
+                passed = db_healthy
+            detail = (
+                "Database is healthy and no longer failing."
+                if passed
+                else "Database is still unstable or the triggering cause is still present."
+            )
+        else:
+            gateway_healthy = services["api-gateway"].status == "healthy"
+            db_healthy = services["database"].status == "healthy"
+            worker_healthy = services["worker"].status == "healthy"
+            passed = (
+                gateway_healthy
+                and db_healthy
+                and worker_healthy
+                and cause_removed
+                and isolated is None
+            )
+            detail = (
+                "End-to-end login traffic is healthy."
+                if passed
+                else "End-to-end traffic still fails or remains degraded."
+            )
+        self._episode["checks"][check_name] = CheckResult(name=check_name, passed=passed, detail=detail)
+        useful = self._mark_evidence_once(f"check:{check_name}:{passed}", detail)
+        return detail, useful, f"Ran {check_name} check."
+    def _declare_resolved(self) -> tuple[bool, float, float, str]:
+        checks = self._episode["checks"]
+        resolution_check = self._recipe().get("resolution_check", "end_to_end")
+        safe_to_resolve = bool(checks.get(resolution_check) and checks[resolution_check].passed)
+        if not safe_to_resolve:
+            self._set_failure("premature_resolution", "The incident is not verified as resolved yet.")
+            return False, self._episode["scenario"]["reward_config"]["premature_resolution_penalty"], 0.0, "Resolution declaration rejected: required checks have not passed."
+        self._episode["incident_resolved"] = True
+        self._episode["done"] = True
+        return True, 0.0, self._episode["scenario"]["reward_config"]["successful_resolution_bonus"], "Incident declared resolved after passing objective checks."
+    def _mark_evidence_once(self, key: str, detail: str) -> bool:
+        if key in self._episode["evidence_seen"]:
+            return False
+        self._episode["evidence_seen"].add(key)
+        self._episode["discovered_evidence"].append(detail)
+        return True
+    def _unsafe_penalty(self) -> float:
+        return float(self._episode["scenario"]["reward_config"]["unsafe_action_penalty"])
+    def _set_failure(self, failure_type: str, why_failed: str) -> None:
+        self._episode["failure_type"] = failure_type
+        self._episode["why_failed"] = why_failed
+    def _advance_world(self) -> None:
+        cause_removed = self._episode["cause_removed"]
+        isolated = self._episode["isolated_service"]
+        if not cause_removed and isolated is None:
+            self._apply_service_updates(self._episode["scenario"].get("degraded_services", {}))
+            scenario = self._episode["scenario"]
+            self._episode["user_impact"] = max(self._episode["user_impact"], scenario.get("degraded_user_impact", self._episode["user_impact"]))
+            self._episode["slo_burn_rate"] = max(self._episode["slo_burn_rate"], scenario.get("degraded_slo_burn", self._episode["slo_burn_rate"]))
+        if isolated is not None and not cause_removed:
+            self._episode["containment_applied"] = True
+        self._episode["workflow_stage"] = self._workflow_stage()
+    def _refresh_alerts(self) -> None:
+        alerts: list[Alert] = []
+        for service_name in SERVICE_ORDER:
+            service = self._episode["services"][service_name]
+            if service.status == "crashed":
+                alerts.append(Alert(service=service_name, severity="critical", message=f"{service_name} is unavailable."))
+            elif service.status == "degraded":
+                alerts.append(Alert(service=service_name, severity="warning", message=f"{service_name} is degraded."))
+        if self._episode["user_impact"] >= 0.3 and not any(alert.service == "api-gateway" for alert in alerts):
+            alerts.append(Alert(service="api-gateway", severity="warning", message="User-visible impact remains elevated."))
+        self._episode["alerts"] = alerts
+    def _update_loop_feedback(self, action: UnifiedIncidentAction, progressed: bool) -> None:
+        action_key = repr(action.model_dump(exclude_none=True))
+        if progressed:
+            self._episode["last_action_key"] = action_key
+            self._episode["repeat_count"] = 0
+            return
+        if self._episode["last_action_key"] == action_key:
+            self._episode["repeat_count"] += 1
+        else:
+            self._episode["repeat_count"] = 1
+        self._episode["last_action_key"] = action_key
+        if self._episode["repeat_count"] >= 2:
+            self._episode["loop_warning"] = "The same no-progress action has repeated; choose a different evidence source or remediation step."
+    def _workflow_stage(self) -> str:
+        if self._episode["incident_resolved"]:
+            return "resolved"
+        checks = self._episode["checks"]
+        if checks["database_recovery"].passed or checks["end_to_end"].passed:
+            return "validation"
+        if self._episode["containment_applied"] or self._episode["cause_removed"] or self._episode["isolated_service"] is not None:
+            return "mitigation"
+        return "triage"
+    def _allowed_actions(self) -> list[str]:
+        return list(ALL_ACTIONS)
+    def _required_fields_by_action(self) -> dict[str, list[str]]:
+        return {action: REQUIRED_FIELDS_BY_ACTION[action] for action in self._allowed_actions()}
+    def _progress_flags(self) -> dict[str, bool]:
+        checks = self._episode["checks"]
+        return {
+            "containment_applied": self._episode["containment_applied"],
+            "cause_removed": self._episode["cause_removed"],
+            "database_recovery": checks["database_recovery"].passed,
+            "end_to_end": checks["end_to_end"].passed,
+            "incident_resolved": self._episode["incident_resolved"],
+            "isolation_applied": self._episode["isolated_service"] is not None,
+        }
+    def _incident_summary(self) -> str:
+        description = self._episode["scenario"].get("description")
+        if description:
+            return description
+        return (
+            "An incident is degrading user traffic. Use evidence-gathering actions to diagnose, "
+            "then choose a safe remediation and verify with explicit checks."
+        )
+    def _prompt_text(self, tool_output: str | None) -> str:
+        lines = [
+            f"TICK {self._episode['tick']}/{self._episode['max_ticks']}",
+            f"WORKFLOW_STAGE: {self._episode['workflow_stage']}",
+            "",
+            "INCIDENT_SUMMARY:",
+            self._incident_summary(),
+            "",
+            "ACTIVE_ALERTS:",
+        ]
+        if self._episode["alerts"]:
+            lines.extend(f"- [{alert.severity.upper()}] {alert.service}: {alert.message}" for alert in self._episode["alerts"])
+        else:
+            lines.append("- none")
+        lines.extend([
+            "",
+            "SERVICES:",
+        ])
+        for service_name in SERVICE_ORDER:
+            health = self._episode["services"][service_name]
+            lines.append(
+                f"- {service_name}: {health.status} cpu={health.cpu_pct:.1f} mem={health.memory_pct:.1f} err={health.error_rate_pct:.1f} latency={health.latency_ms:.1f}"
+            )
+        lines.extend([
+            "",
+            f"USER_IMPACT: {self._episode['user_impact']:.2f}",
+            f"SLO_BURN_RATE: {self._episode['slo_burn_rate']:.2f}",
+            f"LAST_ACTION_RESULT: {self._episode['last_action_result'] or 'none'}",
+            f"TOOL_OUTPUT: {tool_output or 'none'}",
+            f"FAILURE_TYPE: {self._episode['failure_type'] or 'none'}",
+            f"WHY_FAILED: {self._episode['why_failed'] or 'none'}",
+            "",
+            "CHECKS:",
+        ])
+        for check in self._episode["checks"].values():
+            lines.append(f"- {check.name}: {'passed' if check.passed else 'pending'} - {check.detail}")
+        lines.extend([
+            "",
+            "ALLOWED_ACTIONS:",
+        ])
+        lines.extend(f"- {action}" for action in self._allowed_actions())
+        return "\n".join(lines)
+    def _incident_health_potential(self) -> float:
+        weights = self._episode["scenario"]["critical_service_weights"]
+        services = self._episode["services"]
+        operational = sum(weights.get(name, 0.0) * STATUS_VALUES[services[name].status] for name in weights)
+        impact_relief = 1.0 - self._episode["user_impact"]
+        burn_relief = 1.0 - self._episode["slo_burn_rate"]
+        containment = 1.0 if self._episode["containment_applied"] else 0.0
+        return round((0.55 * operational) + (0.2 * impact_relief) + (0.15 * burn_relief) + (0.10 * containment), 4)
+    def _state_dict(self) -> dict[str, Any]:
+        return {
+            "episode_id": self._episode["episode_id"],
+            "step_count": self._episode["step_count"],
+            "scenario_id": self._episode["scenario"]["id"],
+            "difficulty": self._episode["difficulty"],
+            "current_tick": self._episode["tick"],
+            "max_ticks": self._episode["max_ticks"],
+            "workflow_stage": self._episode["workflow_stage"],
+            "active_alerts": [alert.model_dump() for alert in self._episode["alerts"]],
+            "service_health": {name: service.model_dump() for name, service in self._episode["services"].items()},
+            "discovered_evidence": list(self._episode["discovered_evidence"]),
+            "recent_deploys": list(self._episode["recent_deploys"]),
+            "checks": [check.model_dump() for check in self._episode["checks"].values()],
+            "user_impact": self._episode["user_impact"],
+            "slo_burn_rate": self._episode["slo_burn_rate"],
+            "incident_resolved": self._episode["incident_resolved"],
+            "containment_applied": self._episode["containment_applied"],
+            "allowed_actions": self._allowed_actions(),
+            "required_fields_by_action": self._required_fields_by_action(),
+            "valid_action_example": None,
+            "progress_flags": self._progress_flags(),
+            "final_score": self._episode["final_score"],
+            "score_breakdown": dict(self._episode["score_breakdown"]),
+            "cumulative_reward": self._episode["cumulative_reward"],
+            "wasteful_ticks": self._episode["wasteful_ticks"],
+            "last_action_result": self._episode["last_action_result"],
+            "failure_type": self._episode["failure_type"],
+            "why_failed": self._episode["why_failed"],
+        }
+    def _build_observation(self, last_action_result: str, tool_output: str | None, reward: float, done: bool) -> UnifiedIncidentObservation:
+        return UnifiedIncidentObservation(
+            prompt_text=self._prompt_text(tool_output),
+            incident_summary=self._incident_summary(),
+            tick_count=self._episode["tick"],
+            max_ticks=self._episode["max_ticks"],
+            difficulty=self._episode["difficulty"],
+            workflow_stage=self._episode["workflow_stage"],
+            active_alerts=list(self._episode["alerts"]),
+            service_health=dict(self._episode["services"]),
+            discovered_evidence=list(self._episode["discovered_evidence"]),
+            recent_deploys=list(self._episode["recent_deploys"]),
+            checks=list(self._episode["checks"].values()),
+            user_impact=self._episode["user_impact"],
+            slo_burn_rate=self._episode["slo_burn_rate"],
+            incident_resolved=self._episode["incident_resolved"],
+            containment_applied=self._episode["containment_applied"],
+            last_action_result=last_action_result,
+            tool_output=tool_output,
+            failure_type=self._episode["failure_type"],
+            why_failed=self._episode["why_failed"],
+            allowed_actions=self._allowed_actions(),
+            required_fields_by_action=self._required_fields_by_action(),
+            valid_action_example=None,
+            common_trap=self._episode["scenario"].get("description"),
+            loop_warning=self._episode["loop_warning"],
+            blocked_until_security_complete=False,
+            security_unlock_reason=None,
+            best_recovery_action_family=None,
+            progress_flags=self._progress_flags(),
+            security_subquest_status=None,
+            security_context={},
+            final_score=self._episode["final_score"],
+            score_breakdown=dict(self._episode["score_breakdown"]),
+            reward=round(reward, 4),
+            done=done,
+        )

unified_incident_env/server/grader.py ADDED Viewed

	@@ -0,0 +1,145 @@

+"""Deterministic public scoring for the honest narrow incident-remediation environment."""
+from __future__ import annotations
+from typing import Any
+from ..models import GraderCheck, GraderReport
+MIN_PUBLIC_SCORE = 0.01
+MAX_PUBLIC_SCORE = 0.99
+def _strict_public_score(score: float) -> float:
+    return round(min(MAX_PUBLIC_SCORE, max(MIN_PUBLIC_SCORE, score)), 4)
+def _service_score(status: str) -> float:
+    return {
+        "healthy": 1.0,
+        "degraded": 0.4,
+        "crashed": 0.0,
+        "isolated": 0.2,
+    }.get(status, 0.0)
+class UnifiedIncidentGrader:
+    """Deterministic scorer focused on executed effects, not scripted clues."""
+    def compute_breakdown(
+        self,
+        state: dict[str, Any],
+        scenario: dict[str, Any],
+    ) -> dict[str, float]:
+        services = state.get("service_health", {})
+        weights = scenario["critical_service_weights"]
+        recovery_score = round(
+            sum(
+                weights.get(service, 0.0) * _service_score((services.get(service) or {}).get("status", "crashed"))
+                for service in weights
+            ),
+            4,
+        )
+        containment_score = 0.2 if state.get("containment_applied") else 0.0
+        if state.get("containment_applied") and (services.get("worker") or {}).get("status") == "healthy":
+            containment_score = 0.3
+        checks = {item.get("name"): bool(item.get("passed")) for item in state.get("checks", [])}
+        verification_score = 0.0
+        if checks.get("database_recovery"):
+            verification_score += 0.15
+        if checks.get("end_to_end"):
+            verification_score += 0.2
+        user_impact = float(state.get("user_impact", 1.0))
+        impact_score = round(max(0.0, 0.15 * (1.0 - user_impact)), 4)
+        wasteful_ticks = int(state.get("wasteful_ticks", 0))
+        efficiency_score = round(max(0.0, 0.10 - (0.01 * wasteful_ticks)), 4)
+        final_score = _strict_public_score(
+            recovery_score + containment_score + verification_score + impact_score + efficiency_score
+        )
+        return {
+            "recovery_score": recovery_score,
+            "containment_score": round(containment_score, 4),
+            "verification_score": round(verification_score, 4),
+            "impact_score": impact_score,
+            "efficiency_score": efficiency_score,
+            "final_score": final_score,
+        }
+    def build_report(self, state: dict[str, Any], scenario: dict[str, Any]) -> GraderReport:
+        breakdown = self.compute_breakdown(state, scenario)
+        checks = {item.get("name"): bool(item.get("passed")) for item in state.get("checks", [])}
+        passed = bool(
+            state.get("incident_resolved")
+            and checks.get("database_recovery")
+            and checks.get("end_to_end")
+        )
+        report_checks = [
+            GraderCheck(
+                name="root_cause_removed",
+                passed=bool(state.get("containment_applied")),
+                detail=(
+                    "The root cause has been safely contained or removed."
+                    if state.get("containment_applied")
+                    else "The root cause is still active or only partially contained."
+                ),
+                weight=0.30,
+            ),
+            GraderCheck(
+                name="database_recovery",
+                passed=checks.get("database_recovery", False),
+                detail=(
+                    "The database recovery check passed."
+                    if checks.get("database_recovery")
+                    else "The database recovery check has not passed yet."
+                ),
+                weight=0.20,
+            ),
+            GraderCheck(
+                name="end_to_end_check",
+                passed=checks.get("end_to_end", False),
+                detail=(
+                    "The end-to-end service check passed."
+                    if checks.get("end_to_end")
+                    else "The end-to-end service check has not passed yet."
+                ),
+                weight=0.20,
+            ),
+            GraderCheck(
+                name="critical_services_recovered",
+                passed=breakdown["recovery_score"] >= 0.8,
+                detail=(
+                    "Critical-path services are recovered."
+                    if breakdown["recovery_score"] >= 0.8
+                    else "Critical-path services are still degraded or crashed."
+                ),
+                weight=0.20,
+            ),
+            GraderCheck(
+                name="declare_resolved",
+                passed=bool(state.get("incident_resolved")),
+                detail=(
+                    "The agent declared the incident resolved after objective checks passed."
+                    if state.get("incident_resolved")
+                    else "The incident has not been safely declared resolved."
+                ),
+                weight=0.10,
+            ),
+        ]
+        return GraderReport(
+            scenario_id=scenario["id"],
+            passed=passed,
+            score=breakdown["final_score"],
+            message=(
+                "Incident diagnosed, remediated, and verified honestly."
+                if passed
+                else "Incident is not yet safely resolved."
+            ),
+            breakdown=breakdown,
+            checks=report_checks,
+        )

unified_incident_env/tests/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Tests for the unified incident environment."""

unified_incident_env/tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,192 @@

+"""Behavior and API tests for the honest narrow incident environment."""
+from __future__ import annotations
+from fastapi.testclient import TestClient
+from unified_incident_env.models import HypothesisPayload, UnifiedIncidentAction
+from unified_incident_env.server import app as app_module
+from unified_incident_env.server.challenge import DEFAULT_SCENARIO_ID, list_baselines
+from unified_incident_env.server.environment import UnifiedIncidentEnvironment
+def _run_baseline(env: UnifiedIncidentEnvironment):
+    env.reset(scenario_id=DEFAULT_SCENARIO_ID)
+    last = None
+    for step in list_baselines(DEFAULT_SCENARIO_ID).baselines[0].actions:
+        last = env.step(step.action)
+    return last
+def test_baseline_resolves_honestly() -> None:
+    env = UnifiedIncidentEnvironment()
+    obs = _run_baseline(env)
+    assert obs is not None
+    assert obs.done is True
+    assert obs.incident_resolved is True
+    checks = {check.name: check.passed for check in obs.checks}
+    assert checks["database_recovery"] is True
+    assert checks["end_to_end"] is True
+    assert obs.final_score > 0.7
+def test_query_deploys_reveals_evidence_but_not_positive_reward() -> None:
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id=DEFAULT_SCENARIO_ID)
+    obs = env.step(UnifiedIncidentAction(action_type="query_deploys", service="worker"))
+    assert obs.reward <= 0.0
+    assert "worker@2026.04.23-bad" in (obs.tool_output or "")
+    assert obs.incident_resolved is False
+def test_restart_database_before_rollback_is_negative() -> None:
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id=DEFAULT_SCENARIO_ID)
+    obs = env.step(UnifiedIncidentAction(action_type="restart_service", service="database"))
+    assert obs.reward < 0.0
+    assert obs.failure_type == "premature_restart"
+    assert obs.incident_resolved is False
+    assert obs.service_health["database"].status == "crashed"
+def test_duplicate_hypothesis_bonus_is_not_farmable() -> None:
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id=DEFAULT_SCENARIO_ID)
+    action = UnifiedIncidentAction(
+        action_type="submit_hypothesis",
+        hypothesis=HypothesisPayload(
+            root_cause="bad_worker_deploy",
+            affected_services=["worker", "database", "api-gateway"],
+            confidence=0.82,
+            recommended_next_action="rollback_deploy",
+        ),
+    )
+    first = env.step(action)
+    second = env.step(action)
+    assert first.reward > second.reward
+    assert second.reward <= 0.0
+def test_isolating_worker_contains_but_does_not_resolve() -> None:
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id=DEFAULT_SCENARIO_ID)
+    isolated = env.step(UnifiedIncidentAction(action_type="isolate_service", service="worker"))
+    assert isolated.containment_applied is True
+    assert isolated.incident_resolved is False
+    checked = env.step(UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"))
+    checks = {check.name: check.passed for check in checked.checks}
+    assert checks["end_to_end"] is False
+def test_declare_resolved_requires_checks() -> None:
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id=DEFAULT_SCENARIO_ID)
+    obs = env.step(UnifiedIncidentAction(action_type="declare_resolved"))
+    assert obs.reward < 0.0
+    assert obs.done is False
+    assert obs.failure_type == "premature_resolution"
+def test_observation_exposes_bounded_actions_without_valid_example() -> None:
+    env = UnifiedIncidentEnvironment()
+    obs = env.reset(scenario_id=DEFAULT_SCENARIO_ID)
+    assert obs.allowed_actions == [
+        "query_logs",
+        "query_metrics",
+        "query_dependencies",
+        "query_deploys",
+        "rollback_deploy",
+        "restart_service",
+        "run_check",
+        "isolate_service",
+        "escalate",
+        "submit_hypothesis",
+        "declare_resolved",
+    ]
+    assert obs.valid_action_example is None
+def test_routes_expose_new_catalog_and_status(monkeypatch) -> None:
+    monkeypatch.setenv("ENABLE_WEB_INTERFACE", "false")
+    client = TestClient(app_module.create_compatible_app())
+    tasks = client.get("/tasks")
+    assert tasks.status_code == 200
+    payload = tasks.json()
+    assert payload["default_scenario_id"] == DEFAULT_SCENARIO_ID
+    scenarios_by_difficulty = {scenario["difficulty"] for scenario in payload["scenarios"]}
+    assert {"easy", "medium", "hard"}.issubset(scenarios_by_difficulty)
+    assert {"easy", "medium", "hard"}.issubset(set(payload["available_difficulties"]))
+    baseline = client.get("/baseline")
+    assert baseline.status_code == 200
+    baseline_payload = baseline.json()
+    baseline_ids = {item["scenario_id"] for item in baseline_payload["baselines"]}
+    assert {"worker_deploy_cascade", "db_config_rollout", "gateway_auth_rollout"}.issubset(baseline_ids)
+    health = client.get("/health")
+    assert health.status_code == 200
+    assert health.json()["status"] in {"ok", "healthy"}
+    status = client.get("/status")
+    assert status.status_code == 200
+    status_payload = status.json()
+    assert status_payload["progress"]["scenario_id"] == DEFAULT_SCENARIO_ID
+    assert status_payload["grader"]["score"] > 0.0
+def _run_baseline_for_scenario(scenario_id: str):
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id=scenario_id)
+    last = None
+    for step in list_baselines(scenario_id).baselines[0].actions:
+        last = env.step(step.action)
+    return last
+def test_medium_baseline_resolves_honestly() -> None:
+    obs = _run_baseline_for_scenario("db_config_rollout")
+    assert obs is not None
+    assert obs.done is True
+    assert obs.incident_resolved is True
+    checks = {check.name: check.passed for check in obs.checks}
+    assert checks["database_recovery"] is True
+    assert checks["end_to_end"] is True
+    assert obs.final_score > 0.7
+def test_hard_baseline_resolves_honestly() -> None:
+    obs = _run_baseline_for_scenario("gateway_auth_rollout")
+    assert obs is not None
+    assert obs.done is True
+    assert obs.incident_resolved is True
+    checks = {check.name: check.passed for check in obs.checks}
+    assert checks["end_to_end"] is True
+    assert obs.final_score > 0.7
+def test_medium_wrong_rollback_target_is_penalized() -> None:
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id="db_config_rollout")
+    obs = env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="worker"))
+    assert obs.reward < 0.0
+    assert obs.failure_type == "wrong_remediation_target"
+    assert obs.incident_resolved is False
+def test_hard_wrong_rollback_target_is_penalized() -> None:
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id="gateway_auth_rollout")
+    obs = env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="worker"))
+    assert obs.reward < 0.0
+    assert obs.failure_type == "wrong_remediation_target"
+def test_hard_does_not_require_database_recovery_check() -> None:
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id="gateway_auth_rollout")
+    env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="api-gateway"))
+    end_to_end = env.step(UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"))
+    assert any(check.name == "end_to_end" and check.passed for check in end_to_end.checks)
+    resolved = env.step(UnifiedIncidentAction(action_type="declare_resolved"))
+    assert resolved.incident_resolved is True

unified_incident_env/tests/test_submission_inference.py ADDED Viewed

	@@ -0,0 +1,119 @@

+from __future__ import annotations
+import inference
+from unified_incident_env.models import Alert, CheckResult, ServiceHealth, UnifiedIncidentObservation
+def make_observation(**overrides: object) -> UnifiedIncidentObservation:
+    defaults = {
+        "prompt_text": "Honest incident prompt",
+        "incident_summary": "Worker deploy is overloading the database.",
+        "tick_count": 0,
+        "max_ticks": 12,
+        "difficulty": "easy",
+        "workflow_stage": "triage",
+        "active_alerts": [
+            Alert(service="database", severity="critical", message="database crashing"),
+            Alert(service="worker", severity="warning", message="worker retry volume elevated"),
+        ],
+        "service_health": {
+            "api-gateway": ServiceHealth(name="api-gateway", status="degraded", cpu_pct=61.0, memory_pct=38.0, error_rate_pct=24.0, latency_ms=640.0),
+            "cache": ServiceHealth(name="cache", status="healthy", cpu_pct=18.0, memory_pct=24.0, error_rate_pct=0.0, latency_ms=14.0),
+            "database": ServiceHealth(name="database", status="crashed", cpu_pct=99.0, memory_pct=97.0, error_rate_pct=100.0, latency_ms=0.0),
+            "worker": ServiceHealth(name="worker", status="degraded", cpu_pct=88.0, memory_pct=71.0, error_rate_pct=19.0, latency_ms=420.0),
+        },
+        "discovered_evidence": [],
+        "recent_deploys": ["Rolled out worker@2026.04.23-bad 12 minutes ago."],
+        "checks": [
+            CheckResult(name="database_recovery", passed=False, detail="Database recovery has not been verified yet."),
+            CheckResult(name="end_to_end", passed=False, detail="End-to-end health has not been verified yet."),
+        ],
+        "user_impact": 0.82,
+        "slo_burn_rate": 0.91,
+        "incident_resolved": False,
+        "containment_applied": False,
+        "last_action_result": "",
+        "tool_output": None,
+        "failure_type": None,
+        "why_failed": None,
+        "allowed_actions": [
+            "query_logs",
+            "query_metrics",
+            "query_dependencies",
+            "query_deploys",
+            "rollback_deploy",
+            "restart_service",
+            "run_check",
+            "isolate_service",
+            "escalate",
+            "submit_hypothesis",
+            "declare_resolved",
+        ],
+        "required_fields_by_action": {
+            "query_logs": ["service"],
+            "query_metrics": ["service", "metric"],
+            "query_dependencies": ["service"],
+            "query_deploys": ["service"],
+            "rollback_deploy": ["service"],
+            "restart_service": ["service"],
+            "run_check": ["check_name"],
+            "isolate_service": ["service"],
+            "escalate": [],
+            "submit_hypothesis": ["hypothesis"],
+            "declare_resolved": [],
+        },
+        "valid_action_example": None,
+        "common_trap": None,
+        "loop_warning": None,
+        "blocked_until_security_complete": False,
+        "security_unlock_reason": None,
+        "best_recovery_action_family": None,
+        "progress_flags": {},
+        "security_subquest_status": None,
+        "security_context": {},
+        "final_score": 0.1,
+        "score_breakdown": {"final_score": 0.1},
+        "reward": 0.0,
+        "done": False,
+    }
+    defaults.update(overrides)
+    return UnifiedIncidentObservation(**defaults)
+def test_log_helpers_match_required_format(capsys) -> None:
+    inference.log_start(task="worker_deploy_cascade", env="unified-incident-env", model="demo-model")
+    inference.log_step(step=2, action='{"action_type":"query_logs","service":"database"}', reward=-0.01, done=False, error=None)
+    inference.log_end(success=True, steps=2, score=0.37, rewards=[-0.01, 0.27])
+    captured = capsys.readouterr().out.strip().splitlines()
+    assert captured == [
+        "[START] task=worker_deploy_cascade env=unified-incident-env model=demo-model",
+        '[STEP] step=2 action={"action_type":"query_logs","service":"database"} reward=-0.01 done=false error=null',
+        "[END] success=true steps=2 score=0.37 rewards=-0.01,0.27",
+    ]
+def test_parse_action_accepts_valid_json() -> None:
+    observation = make_observation()
+    action = inference.parse_action('{"action_type":"query_deploys","service":"worker"}', observation)
+    assert action == inference.UnifiedIncidentAction(action_type="query_deploys", service="worker")
+def test_parse_action_rejects_incomplete_metric_query() -> None:
+    observation = make_observation()
+    assert inference.parse_action('{"action_type":"query_metrics","service":"database"}', observation) is None
+def test_build_user_prompt_includes_public_state_without_examples() -> None:
+    observation = make_observation()
+    prompt = inference.build_user_prompt(observation)
+    assert "Incident summary:" in prompt
+    assert "Allowed actions:" in prompt
+    assert "Required fields:" in prompt
+    assert "Valid example" not in prompt
+    assert "worker@2026.04.23-bad" not in prompt
+def test_build_fallback_action_prefers_public_deploy_query() -> None:
+    observation = make_observation()
+    action = inference.build_fallback_action(observation)
+    assert action == inference.UnifiedIncidentAction(action_type="query_deploys", service="worker")

unified_incident_env/tests/test_trainer.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""Smoke tests for reusable trainer-shell pieces after the v2 pivot."""
+from __future__ import annotations
+from pathlib import Path
+from unified_incident_env.trainer.trajectory_memory import CorrectionMemory
+from unified_incident_env.trainer.trajectory_store import TrajectoryStore
+from unified_incident_env.trainer.types import EpisodeRecord, StepRecord
+def test_correction_memory_empty_prompt_is_safe() -> None:
+    memory = CorrectionMemory()
+    addendum = memory.build_prompt_addendum("worker_deploy_cascade", "triage")
+    assert isinstance(addendum, str)
+def test_trajectory_store_roundtrip(tmp_path: Path) -> None:
+    store = TrajectoryStore(tmp_path / "episodes.jsonl")
+    record = EpisodeRecord(
+        run_id="run-1",
+        scenario_id="worker_deploy_cascade",
+        difficulty="easy",
+        model_name="stub",
+        mode="strict",
+        success=False,
+        final_score=0.1,
+        steps=1,
+        elapsed_s=0.01,
+        step_records=[
+            StepRecord(
+                step_index=1,
+                tick=1,
+                workflow_stage="triage",
+                observation={},
+                prompt_text="prompt",
+                raw_model_output="{}",
+                parse_status="invalid_json",
+                reward=None,
+            )
+        ],
+    )
+    store.append_episode(record)
+    loaded = store.load_episodes()
+    assert len(loaded) == 1
+    assert loaded[0].scenario_id == "worker_deploy_cascade"

unified_incident_env/tests/test_trainer_session.py ADDED Viewed

	@@ -0,0 +1,32 @@

+"""Smoke tests for session/report shells after the v2 pivot."""
+from __future__ import annotations
+from unified_incident_env.trainer.reporting import build_phase_deltas
+from unified_incident_env.trainer.types import SessionPhaseReport
+def test_build_phase_deltas_handles_simple_progression() -> None:
+    phases = [
+        SessionPhaseReport(
+            phase_name="probe",
+            episode_ids=[1, 2],
+            avg_score=0.2,
+            success_rate=0.0,
+            schema_failures=1,
+            loop_failures=1,
+            updates_applied=[],
+        ),
+        SessionPhaseReport(
+            phase_name="final_evaluation",
+            episode_ids=[3, 4],
+            avg_score=0.8,
+            success_rate=1.0,
+            schema_failures=0,
+            loop_failures=0,
+            updates_applied=[],
+        ),
+    ]
+    deltas = build_phase_deltas(phases)
+    assert deltas[1].phase_name == "final_evaluation"
+    assert deltas[1].score_delta == 0.6

unified_incident_env/trainer/__init__.py ADDED Viewed

	@@ -0,0 +1,8 @@

+"""Trainer package namespace.
+This package intentionally avoids eager importing of legacy trainer flows so the
+honest v2 environment can reuse shell utilities without pulling in deprecated
+benchmark-specific modules at import time.
+"""
+__all__: list[str] = []

unified_incident_env/trainer/action_adapter.py ADDED Viewed

	@@ -0,0 +1,204 @@

+"""Strict and lenient action parsers for training and eval."""
+from __future__ import annotations
+import json
+from typing import Any
+from ..models import ActionType, UnifiedIncidentAction
+from .types import ParseResult
+_ALLOWED_KEYS = {
+    "action_type",
+    "service",
+    "metric",
+    "vulnerability_type",
+    "patch_id",
+    "postmortem",
+}
+_KNOWN_ACTIONS: set[str] = {
+    "query_logs",
+    "query_metrics",
+    "query_dependencies",
+    "restart_service",
+    "rollback_deploy",
+    "inspect_code",
+    "classify_vulnerability",
+    "apply_patch",
+    "verify_security_fix",
+    "submit_security_fix",
+    "submit_postmortem",
+}
+def _extract_json_text(raw_text: str) -> str:
+    text = raw_text.strip()
+    if "```" in text:
+        parts = text.split("```")
+        if len(parts) >= 2:
+            text = parts[1]
+            if text.startswith("json"):
+                text = text[4:]
+    start = text.find("{")
+    end = text.rfind("}")
+    if start != -1 and end != -1 and start < end:
+        text = text[start : end + 1]
+    return text.strip()
+def _compact_action(action: UnifiedIncidentAction) -> dict[str, Any]:
+    payload = action.model_dump(exclude_none=True)
+    if payload.get("metadata") == {}:
+        payload.pop("metadata", None)
+    return payload
+class StrictActionParser:
+    """Exact parser for judge-style evaluation."""
+    def parse(self, raw_text: str) -> ParseResult:
+        bare = raw_text.strip().strip('"').strip("'")
+        if bare in {"inspect_code", "verify_security_fix", "submit_security_fix"}:
+            action = UnifiedIncidentAction(action_type=bare)
+            return ParseResult(
+                parse_status="repaired",
+                cleaned_action=_compact_action(action),
+                repair_labels=["bare_action_wrapped"],
+            )
+        try:
+            data = json.loads(_extract_json_text(raw_text))
+        except Exception as exc:
+            return ParseResult(parse_status="invalid_json", error=type(exc).__name__)
+        if not isinstance(data, dict):
+            return ParseResult(parse_status="invalid_action", error="root must be object")
+        repaired_labels: list[str] = []
+        cleaned: dict[str, Any] = {k: v for k, v in data.items() if k in _ALLOWED_KEYS}
+        repaired = cleaned != data
+        if repaired:
+            repaired_labels.append("extra_keys_stripped")
+        if "action_type" not in cleaned and isinstance(data.get("action"), str):
+            if data["action"] in _KNOWN_ACTIONS:
+                cleaned["action_type"] = data["action"]
+                repaired = True
+                repaired_labels.append("action_alias_normalized")
+        if (
+            "vulnerability_type" not in cleaned
+            and isinstance(data.get("vulnerability"), str)
+        ):
+            cleaned["vulnerability_type"] = data["vulnerability"]
+            repaired = True
+            repaired_labels.append("vulnerability_alias_normalized")
+        metrics_value = data.get("metrics")
+        if "metric" not in cleaned and isinstance(metrics_value, list) and len(metrics_value) == 1:
+            cleaned["metric"] = metrics_value[0]
+            repaired = True
+            repaired_labels.append("metric_list_normalized")
+        if "metrics" in data and (
+            not isinstance(metrics_value, list) or len(metrics_value) != 1
+        ):
+            return ParseResult(
+                parse_status="invalid_action",
+                error="metrics alias is ambiguous",
+                repair_labels=repaired_labels,
+            )
+        try:
+            action = UnifiedIncidentAction(**cleaned)
+        except Exception as exc:
+            return ParseResult(
+                parse_status="invalid_action",
+                error=str(exc),
+                repair_labels=repaired_labels,
+            )
+        return ParseResult(
+            parse_status="repaired" if repaired else "ok",
+            cleaned_action=_compact_action(action),
+            repair_labels=repaired_labels,
+        )
+class LenientActionAdapter:
+    """Training-time parser that repairs small schema mistakes only."""
+    def parse(self, raw_text: str) -> ParseResult:
+        bare = raw_text.strip().strip('"').strip("'")
+        if bare in _KNOWN_ACTIONS:
+            try:
+                action = UnifiedIncidentAction(action_type=bare)
+            except Exception as exc:
+                return ParseResult(
+                    parse_status="invalid_action",
+                    error=str(exc),
+                    repair_labels=["bare_action_wrapped"],
+                )
+            return ParseResult(
+                parse_status="repaired",
+                cleaned_action=_compact_action(action),
+                repair_labels=["bare_action_wrapped"],
+            )
+        try:
+            data = json.loads(_extract_json_text(raw_text))
+        except Exception as exc:
+            return ParseResult(parse_status="invalid_json", error=type(exc).__name__)
+        if not isinstance(data, dict):
+            return ParseResult(parse_status="invalid_action", error="root must be object")
+        repaired_labels: list[str] = []
+        cleaned: dict[str, Any] = {k: v for k, v in data.items() if k in _ALLOWED_KEYS}
+        repaired = cleaned != data
+        if repaired:
+            repaired_labels.append("extra_keys_stripped")
+        if "action_type" not in cleaned and isinstance(data.get("action"), str):
+            if data["action"] in _KNOWN_ACTIONS:
+                cleaned["action_type"] = data["action"]
+                repaired = True
+                repaired_labels.append("action_alias_normalized")
+        if (
+            "vulnerability_type" not in cleaned
+            and isinstance(data.get("vulnerability"), str)
+        ):
+            cleaned["vulnerability_type"] = data["vulnerability"]
+            repaired = True
+            repaired_labels.append("vulnerability_alias_normalized")
+        metrics_value = data.get("metrics")
+        if "metric" not in cleaned and isinstance(metrics_value, list) and len(metrics_value) == 1:
+            cleaned["metric"] = metrics_value[0]
+            repaired = True
+            repaired_labels.append("metric_list_normalized")
+        if "metrics" in data and (
+            not isinstance(metrics_value, list) or len(metrics_value) != 1
+        ):
+            return ParseResult(
+                parse_status="invalid_action",
+                error="metrics alias is ambiguous",
+                repair_labels=repaired_labels,
+            )
+        try:
+            action = UnifiedIncidentAction(**cleaned)
+        except Exception as exc:
+            return ParseResult(
+                parse_status="invalid_action",
+                error=str(exc),
+                repair_labels=repaired_labels,
+            )
+        return ParseResult(
+            parse_status="repaired" if repaired else "ok",
+            cleaned_action=_compact_action(action),
+            repair_labels=repaired_labels,
+        )

unified_incident_env/trainer/analyze_failures.py ADDED Viewed

	@@ -0,0 +1,251 @@

+"""Failure analysis for episode trajectories."""
+from __future__ import annotations
+from collections import Counter
+from .types import EpisodeRecord, FailureAnalysisReport, FailureBucketEntry, StepRecord
+_INFRA_ACTIONS = {"restart_service", "rollback_deploy"}
+def analyze_episode(record: EpisodeRecord) -> FailureAnalysisReport:
+    """Classify one episode into schema, policy, looping, and reasoning buckets."""
+    entries: list[FailureBucketEntry] = []
+    for step in record.step_records:
+        entries.extend(_classify_step(record, step))
+    entries.extend(_classify_episode_level(record))
+    schema = sorted({entry.failure_type for entry in entries if entry.bucket == "schema"})
+    policy = sorted({entry.failure_type for entry in entries if entry.bucket == "policy"})
+    looping = sorted({entry.failure_type for entry in entries if entry.bucket == "looping"})
+    reasoning = sorted({entry.failure_type for entry in entries if entry.bucket == "reasoning"})
+    summary = Counter(entry.bucket for entry in entries)
+    return FailureAnalysisReport(
+        episode_ids=[record.episode_id or 0],
+        scenario_ids=[record.scenario_id],
+        entries=entries,
+        schema_failures=schema,
+        policy_failures=policy,
+        looping_failures=looping,
+        reasoning_failures=reasoning,
+        summary={
+            "schema": summary.get("schema", 0),
+            "policy": summary.get("policy", 0),
+            "looping": summary.get("looping", 0),
+            "reasoning": summary.get("reasoning", 0),
+        },
+    )
+def analyze_block(records: list[EpisodeRecord]) -> FailureAnalysisReport:
+    """Combine multiple episode analyses into one block report."""
+    analyses = [analyze_episode(record) for record in records]
+    entries = [entry for analysis in analyses for entry in analysis.entries]
+    summary = Counter(entry.bucket for entry in entries)
+    return FailureAnalysisReport(
+        episode_ids=[record.episode_id or 0 for record in records],
+        scenario_ids=[record.scenario_id for record in records],
+        entries=entries,
+        schema_failures=sorted({entry.failure_type for entry in entries if entry.bucket == "schema"}),
+        policy_failures=sorted({entry.failure_type for entry in entries if entry.bucket == "policy"}),
+        looping_failures=sorted({entry.failure_type for entry in entries if entry.bucket == "looping"}),
+        reasoning_failures=sorted({entry.failure_type for entry in entries if entry.bucket == "reasoning"}),
+        summary={
+            "schema": summary.get("schema", 0),
+            "policy": summary.get("policy", 0),
+            "looping": summary.get("looping", 0),
+            "reasoning": summary.get("reasoning", 0),
+        },
+    )
+def _classify_step(record: EpisodeRecord, step: StepRecord) -> list[FailureBucketEntry]:
+    entries: list[FailureBucketEntry] = []
+    if step.parse_status in {"invalid_json", "invalid_action"}:
+        entries.append(
+            FailureBucketEntry(
+                episode_id=record.episode_id or 0,
+                scenario_id=record.scenario_id,
+                step_index=step.step_index,
+                bucket="schema",
+                failure_type=_schema_failure_type(step),
+                detail=step.failure_reason or "schema failure",
+            )
+        )
+        return entries
+    student = step.cleaned_action or {}
+    teacher = step.teacher_action or {}
+    if not teacher or not student or student == teacher:
+        return entries
+    student_type = student.get("action_type")
+    teacher_type = teacher.get("action_type")
+    if student_type == "classify_vulnerability":
+        failure_type = (
+            "wrong_vulnerability"
+            if teacher_type == "classify_vulnerability"
+            else "fails_to_identify_real_vulnerability"
+        )
+        entries.append(
+            FailureBucketEntry(
+                episode_id=record.episode_id or 0,
+                scenario_id=record.scenario_id,
+                step_index=step.step_index,
+                bucket="reasoning",
+                failure_type=failure_type,
+                detail=f"student={student} teacher={teacher}",
+            )
+        )
+        return entries
+    if student_type == "apply_patch" and teacher_type == "apply_patch":
+        entries.append(
+            FailureBucketEntry(
+                episode_id=record.episode_id or 0,
+                scenario_id=record.scenario_id,
+                step_index=step.step_index,
+                bucket="policy",
+                failure_type="wrong_patch",
+                detail=f"student={student} teacher={teacher}",
+            )
+        )
+        return entries
+    if student_type == "verify_security_fix" and teacher_type != "verify_security_fix":
+        entries.append(
+            FailureBucketEntry(
+                episode_id=record.episode_id or 0,
+                scenario_id=record.scenario_id,
+                step_index=step.step_index,
+                bucket="policy",
+                failure_type="verify_too_early",
+                detail=f"student={student} teacher={teacher}",
+            )
+        )
+        return entries
+    if student_type == "submit_security_fix" and teacher_type != "submit_security_fix":
+        entries.append(
+            FailureBucketEntry(
+                episode_id=record.episode_id or 0,
+                scenario_id=record.scenario_id,
+                step_index=step.step_index,
+                bucket="policy",
+                failure_type="submit_too_early",
+                detail=f"student={student} teacher={teacher}",
+            )
+        )
+        return entries
+    if student_type in _INFRA_ACTIONS and teacher_type not in _INFRA_ACTIONS:
+        entries.append(
+            FailureBucketEntry(
+                episode_id=record.episode_id or 0,
+                scenario_id=record.scenario_id,
+                step_index=step.step_index,
+                bucket="policy",
+                failure_type="infra_before_security",
+                detail=f"student={student} teacher={teacher}",
+            )
+        )
+        return entries
+    if student_type in _INFRA_ACTIONS and teacher_type in _INFRA_ACTIONS:
+        failure_type = "wrong_service"
+        if student_type == "restart_service":
+            failure_type = "wrong_restart"
+        elif student_type == "rollback_deploy":
+            failure_type = "wrong_rollback"
+        entries.append(
+            FailureBucketEntry(
+                episode_id=record.episode_id or 0,
+                scenario_id=record.scenario_id,
+                step_index=step.step_index,
+                bucket="policy",
+                failure_type=failure_type,
+                detail=f"student={student} teacher={teacher}",
+            )
+        )
+        return entries
+    entries.append(
+        FailureBucketEntry(
+            episode_id=record.episode_id or 0,
+            scenario_id=record.scenario_id,
+            step_index=step.step_index,
+            bucket="policy",
+            failure_type="wrong_action_choice",
+            detail=f"student={student} teacher={teacher}",
+        )
+    )
+    return entries
+def _classify_episode_level(record: EpisodeRecord) -> list[FailureBucketEntry]:
+    entries: list[FailureBucketEntry] = []
+    previous = None
+    repeat_count = 0
+    for step in record.step_records:
+        current = step.cleaned_action
+        if current and current == previous:
+            repeat_count += 1
+            if repeat_count >= 1:
+                entries.append(
+                    FailureBucketEntry(
+                        episode_id=record.episode_id or 0,
+                        scenario_id=record.scenario_id,
+                        step_index=step.step_index,
+                        bucket="looping",
+                        failure_type="repeated_same_action",
+                        detail=f"action={current}",
+                    )
+                )
+        else:
+            repeat_count = 0
+        previous = current
+    stopped = record.stopped_reason or ""
+    if stopped in {"diagnosis", "root_cause_analysis"}:
+        entries.append(
+            FailureBucketEntry(
+                episode_id=record.episode_id or 0,
+                scenario_id=record.scenario_id,
+                step_index=None,
+                bucket="looping",
+                failure_type="stuck_in_diagnosis",
+                detail=f"stopped_reason={stopped}",
+            )
+        )
+    elif stopped == "security_subquest":
+        entries.append(
+            FailureBucketEntry(
+                episode_id=record.episode_id or 0,
+                scenario_id=record.scenario_id,
+                step_index=None,
+                bucket="looping",
+                failure_type="stuck_in_security_subquest",
+                detail=f"stopped_reason={stopped}",
+            )
+        )
+    return entries
+def _schema_failure_type(step: StepRecord) -> str:
+    raw = step.raw_model_output.lower()
+    error = (step.failure_reason or "").lower()
+    if '"reason"' in raw or '"details"' in raw or "extra_forbidden" in error:
+        return "extra_unsupported_fields"
+    if '"services"' in raw or '"metrics"' in raw or "field required" in error:
+        return "wrong_field_names"
+    if "required" in error or "missing" in error:
+        return "missing_required_fields"
+    if step.parse_status == "invalid_json":
+        return "invalid_json"
+    return "invalid_action"

unified_incident_env/trainer/backend.py ADDED Viewed

	@@ -0,0 +1,165 @@

+"""Backend interfaces for model calls."""
+from __future__ import annotations
+import json
+import time
+from typing import Protocol
+from urllib.parse import urlparse
+from openai import OpenAI
+from .types import ModelRequest, ModelResponse
+class ModelBackend(Protocol):
+    """Minimal backend protocol for trainer use."""
+    def complete(self, request: ModelRequest) -> ModelResponse:
+        """Return raw model text and metadata for one request."""
+class OpenAICompatibleBackend:
+    """OpenAI-compatible backend, suitable for Ollama and similar servers."""
+    def __init__(
+        self,
+        *,
+        base_url: str,
+        api_key: str,
+        timeout_s: float = 90.0,
+    ) -> None:
+        self.base_url = base_url
+        self._client = OpenAI(api_key=api_key, base_url=base_url, timeout=timeout_s)
+    def complete(self, request: ModelRequest) -> ModelResponse:
+        started = time.perf_counter()
+        create_kwargs = {
+            "model": request.model_name,
+            "temperature": request.temperature,
+            "max_tokens": request.max_tokens,
+            "messages": [
+                {"role": "system", "content": request.system_prompt},
+                {"role": "user", "content": request.user_prompt},
+            ],
+        }
+        raw_text = ""
+        actual_mode = request.structured_mode
+        if request.structured_mode == "backend_adaptive":
+            if self._is_ollama():
+                actual_mode = "response_format_json"
+            else:
+                actual_mode = "tool_calling"
+        try:
+            if actual_mode == "tool_calling":
+                tool_choice = request.tool_choice or {
+                    "type": "function",
+                    "function": {"name": "emit_action"},
+                }
+                create_kwargs["tools"] = request.tools or [
+                    self._tool_from_response_format(request.response_format)
+                ]
+                create_kwargs["tool_choice"] = tool_choice
+                response = self._client.chat.completions.create(**create_kwargs)
+                raw_text = self._extract_tool_text(response)
+            elif actual_mode == "response_format_json":
+                if self._is_ollama():
+                    create_kwargs["extra_body"] = {
+                        "format": self._ollama_format_payload(request.response_format)
+                    }
+                else:
+                    create_kwargs["response_format"] = request.response_format or {
+                        "type": "json_object"
+                    }
+                response = self._client.chat.completions.create(**create_kwargs)
+                raw_text = response.choices[0].message.content or ""
+            else:
+                response = self._client.chat.completions.create(**create_kwargs)
+                raw_text = response.choices[0].message.content or ""
+        except Exception:
+            if request.structured_mode == "backend_adaptive" and actual_mode == "tool_calling":
+                fallback_kwargs = dict(create_kwargs)
+                if "tools" in fallback_kwargs:
+                    del fallback_kwargs["tools"]
+                if "tool_choice" in fallback_kwargs:
+                    del fallback_kwargs["tool_choice"]
+                if self._is_ollama():
+                    fallback_kwargs["extra_body"] = {
+                        "format": self._ollama_format_payload(request.response_format)
+                    }
+                else:
+                    fallback_kwargs["response_format"] = request.response_format or {
+                        "type": "json_object"
+                    }
+                response = self._client.chat.completions.create(**fallback_kwargs)
+                raw_text = response.choices[0].message.content or ""
+                actual_mode = "response_format_json"
+            else:
+                raise
+        elapsed = time.perf_counter() - started
+        return ModelResponse(
+            raw_text=raw_text,
+            latency_s=round(elapsed, 4),
+            metadata={
+                "model": request.model_name,
+                "structured_mode": actual_mode,
+            },
+        )
+    def _is_ollama(self) -> bool:
+        parsed = urlparse(self.base_url)
+        host = parsed.netloc.lower()
+        return "11434" in host or "ollama" in host or "127.0.0.1" in host or "localhost" in host
+    def _ollama_format_payload(
+        self,
+        response_format: dict[str, object] | None,
+    ) -> object:
+        if response_format and response_format.get("type") == "json_schema":
+            json_schema = response_format.get("json_schema", {})
+            if isinstance(json_schema, dict):
+                return json_schema.get("schema", "json")
+        return "json"
+    def _tool_from_response_format(
+        self,
+        response_format: dict[str, object] | None,
+    ) -> dict[str, object]:
+        schema = {
+            "type": "object",
+            "properties": {"action_type": {"type": "string"}},
+            "required": ["action_type"],
+            "additionalProperties": False,
+        }
+        if response_format and response_format.get("type") == "json_schema":
+            json_schema = response_format.get("json_schema", {})
+            schema = json_schema.get("schema", schema)  # type: ignore[assignment]
+        return {
+            "type": "function",
+            "function": {
+                "name": "emit_action",
+                "description": "Emit exactly one structured environment action.",
+                "parameters": schema,
+            },
+        }
+    def _extract_tool_text(self, response) -> str:
+        message = response.choices[0].message
+        tool_calls = getattr(message, "tool_calls", None) or []
+        if tool_calls:
+            function = getattr(tool_calls[0], "function", None)
+            if function is not None and getattr(function, "arguments", None):
+                return function.arguments
+        content = message.content or ""
+        if isinstance(content, list):
+            fragments = []
+            for item in content:
+                if isinstance(item, dict) and item.get("type") == "text":
+                    fragments.append(item.get("text", ""))
+            return "".join(fragments)
+        if isinstance(content, str):
+            return content
+        return json.dumps(content)

unified_incident_env/trainer/build_datasets.py ADDED Viewed

	@@ -0,0 +1,258 @@

+"""Build correction datasets from trajectories and failure analyses."""
+from __future__ import annotations
+import argparse
+from pathlib import Path
+from .build_sft_dataset import build_baseline_records
+from .trajectory_store import TrajectoryStore
+from .types import EpisodeRecord, FailureAnalysisReport, SFTRecord
+def build_schema_repair_records(
+    episodes: list[EpisodeRecord],
+    analyses: list[FailureAnalysisReport],
+) -> list[SFTRecord]:
+    rows: list[SFTRecord] = []
+    analysis_by_episode = {
+        analysis.episode_ids[0]: analysis for analysis in analyses if analysis.episode_ids
+    }
+    for episode in episodes:
+        analysis = analysis_by_episode.get(episode.episode_id or 0)
+        schema_types = set(analysis.schema_failures if analysis else [])
+        for step in episode.step_records:
+            if step.parse_status not in {"invalid_json", "invalid_action", "repaired", "teacher_override"}:
+                continue
+            if step.teacher_action is None:
+                continue
+            rows.append(
+                SFTRecord(
+                    source="schema_repair",
+                    scenario_id=episode.scenario_id,
+                    tick=step.tick,
+                    messages=[
+                        {"role": "system", "content": "Repair the action into strict JSON only."},
+                        {
+                            "role": "user",
+                            "content": (
+                                f"{step.prompt_text}\n\n"
+                                f"Previous invalid output:\n{step.raw_model_output}"
+                            ),
+                        },
+                    ],
+                    target_action=step.teacher_action,
+                    student_action=step.cleaned_action,
+                    parse_status=step.parse_status,
+                    tags=sorted(schema_types) or [step.parse_status],
+                    metadata={
+                        "episode_id": episode.episode_id,
+                        "step_index": step.step_index,
+                        "repair_retry_used": step.repair_retry_used,
+                        "teacher_override_used": step.teacher_override_used,
+                        "normalization_applied": step.normalization_applied,
+                        "failure_type": step.observation.get("failure_type"),
+                        "why_failed": step.observation.get("why_failed"),
+                        "loop_warning": step.observation.get("loop_warning"),
+                        "blocked_until_security_complete": step.observation.get("blocked_until_security_complete"),
+                        "security_unlock_reason": step.observation.get("security_unlock_reason"),
+                        "progress_flags": step.observation.get("progress_flags"),
+                    },
+                )
+            )
+    return rows
+def build_next_action_records(
+    episodes: list[EpisodeRecord],
+    analyses: list[FailureAnalysisReport],
+) -> list[SFTRecord]:
+    rows: list[SFTRecord] = []
+    episode_entries = {
+        analysis.episode_ids[0]: analysis.entries
+        for analysis in analyses
+        if analysis.episode_ids
+    }
+    allowed = {"policy", "reasoning", "looping"}
+    for episode in episodes:
+        entries = episode_entries.get(episode.episode_id or 0, [])
+        step_indices = {
+            entry.step_index
+            for entry in entries
+            if entry.bucket in allowed and entry.step_index is not None
+        }
+        for step in episode.step_records:
+            if step.step_index not in step_indices:
+                continue
+            if step.teacher_action is None:
+                continue
+            tags = [
+                entry.failure_type
+                for entry in entries
+                if entry.step_index == step.step_index and entry.bucket in allowed
+            ]
+            rows.append(
+                SFTRecord(
+                    source="next_action",
+                    scenario_id=episode.scenario_id,
+                    tick=step.tick,
+                    messages=[
+                        {"role": "system", "content": "Choose the best next action as strict JSON only."},
+                        {"role": "user", "content": step.prompt_text},
+                    ],
+                    target_action=step.teacher_action,
+                    student_action=step.cleaned_action,
+                    parse_status=step.parse_status,
+                    tags=sorted(set(tags)) or ["next_action"],
+                    metadata={
+                        "episode_id": episode.episode_id,
+                        "step_index": step.step_index,
+                        "workflow_stage": step.workflow_stage,
+                        "teacher_override_used": step.teacher_override_used,
+                        "failure_type": step.observation.get("failure_type"),
+                        "why_failed": step.observation.get("why_failed"),
+                        "loop_warning": step.observation.get("loop_warning"),
+                        "progress_flags": step.observation.get("progress_flags"),
+                    },
+                )
+            )
+    return rows
+def build_recovery_records(
+    episodes: list[EpisodeRecord],
+    analyses: list[FailureAnalysisReport],
+) -> list[SFTRecord]:
+    rows: list[SFTRecord] = []
+    episode_entries = {
+        analysis.episode_ids[0]: analysis.entries
+        for analysis in analyses
+        if analysis.episode_ids
+    }
+    recovery_failures = {
+        "wrong_restart",
+        "wrong_rollback",
+        "wrong_service",
+        "wrong_patch",
+        "wrong_vulnerability",
+        "verify_too_early",
+        "submit_too_early",
+        "infra_before_security",
+        "repeated_same_action",
+    }
+    for episode in episodes:
+        entries = episode_entries.get(episode.episode_id or 0, [])
+        step_indices = {
+            entry.step_index
+            for entry in entries
+            if entry.failure_type in recovery_failures and entry.step_index is not None
+        }
+        for step in episode.step_records:
+            if step.step_index not in step_indices:
+                continue
+            if step.teacher_action is None or not step.next_prompt_text:
+                continue
+            tags = [
+                entry.failure_type
+                for entry in entries
+                if entry.step_index == step.step_index
+                and entry.failure_type in recovery_failures
+            ]
+            rows.append(
+                SFTRecord(
+                    source="recovery",
+                    scenario_id=episode.scenario_id,
+                    tick=step.tick,
+                    messages=[
+                        {"role": "system", "content": "Recover from the previous mistake. Return the best next strict JSON action only."},
+                        {
+                            "role": "user",
+                            "content": (
+                                f"{step.next_prompt_text}\n\n"
+                                f"Previous wrong action: {step.cleaned_action}\n"
+                                f"Penalty or result: reward={step.reward}"
+                            ),
+                        },
+                    ],
+                    target_action=step.teacher_action,
+                    student_action=step.cleaned_action,
+                    parse_status=step.parse_status,
+                    tags=sorted(set(tags)) or ["recovery"],
+                    metadata={
+                        "episode_id": episode.episode_id,
+                        "step_index": step.step_index,
+                        "teacher_override_used": step.teacher_override_used,
+                        "failure_type": step.observation.get("failure_type"),
+                        "why_failed": step.observation.get("why_failed"),
+                        "loop_warning": step.observation.get("loop_warning"),
+                        "best_recovery_action_family": step.observation.get("best_recovery_action_family"),
+                    },
+                )
+            )
+    return rows
+def combine_sft_records(
+    *,
+    baseline_records: list[SFTRecord],
+    schema_records: list[SFTRecord],
+    next_action_records: list[SFTRecord],
+    recovery_records: list[SFTRecord],
+) -> list[SFTRecord]:
+    return [
+        *baseline_records,
+        *schema_records,
+        *next_action_records,
+        *recovery_records,
+    ]
+def write_jsonl(records: list[SFTRecord], path: Path) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("w", encoding="utf-8") as handle:
+        for record in records:
+            handle.write(record.model_dump_json())
+            handle.write("\n")
+def load_episodes(path: Path) -> list[EpisodeRecord]:
+    return TrajectoryStore(path).load_episodes()
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--episodes", default="outputs/trainer/episodes.jsonl")
+    parser.add_argument("--output-dir", required=True)
+    args = parser.parse_args()
+    output_dir = Path(args.output_dir)
+    episodes = load_episodes(Path(args.episodes))
+    from .analyze_failures import analyze_episode
+    analyses = [analyze_episode(episode) for episode in episodes]
+    baseline_records = build_baseline_records()
+    schema_records = build_schema_repair_records(episodes, analyses)
+    next_action_records = build_next_action_records(episodes, analyses)
+    recovery_records = build_recovery_records(episodes, analyses)
+    combined_records = combine_sft_records(
+        baseline_records=baseline_records,
+        schema_records=schema_records,
+        next_action_records=next_action_records,
+        recovery_records=recovery_records,
+    )
+    write_jsonl(baseline_records, output_dir / "baseline_teacher_dataset.jsonl")
+    write_jsonl(schema_records, output_dir / "schema_repair.jsonl")
+    write_jsonl(next_action_records, output_dir / "next_action.jsonl")
+    write_jsonl(recovery_records, output_dir / "recovery.jsonl")
+    write_jsonl(combined_records, output_dir / "sft_dataset.jsonl")
+    print(
+        f"wrote baseline={len(baseline_records)} schema={len(schema_records)} "
+        f"next_action={len(next_action_records)} recovery={len(recovery_records)} "
+        f"combined={len(combined_records)} to {output_dir}"
+    )
+if __name__ == "__main__":
+    main()

unified_incident_env/trainer/build_sft_dataset.py ADDED Viewed

	@@ -0,0 +1,101 @@

+"""Build supervised JSONL datasets from baseline and replay trajectories."""
+from __future__ import annotations
+import argparse
+from pathlib import Path
+from ..scripts.baseline_agent import plan_for_scenario
+from ..server.challenge import SCENARIOS
+from ..server.environment import UnifiedIncidentEnvironment
+from .prompts import TRAINING_SYSTEM_PROMPT
+from .trajectory_store import TrajectoryStore
+from .types import SFTRecord
+def build_baseline_records() -> list[SFTRecord]:
+    rows: list[SFTRecord] = []
+    for scenario_id in SCENARIOS:
+        env = UnifiedIncidentEnvironment()
+        obs = env.reset(scenario_id=scenario_id)
+        for step_index, action in enumerate(plan_for_scenario(scenario_id), start=1):
+            rows.append(
+                SFTRecord(
+                    source="baseline",
+                    scenario_id=scenario_id,
+                    tick=obs.tick_count,
+                    messages=[
+                        {"role": "system", "content": TRAINING_SYSTEM_PROMPT},
+                        {"role": "user", "content": obs.prompt_text},
+                    ],
+                    target_action=action.model_dump(exclude_none=True),
+                    tags=["teacher", f"step_{step_index}"],
+                )
+            )
+            obs = env.step(action)
+    return rows
+def build_replay_records(episodes_path: Path) -> list[SFTRecord]:
+    rows: list[SFTRecord] = []
+    for episode in TrajectoryStore(episodes_path).load_episodes():
+        for step in episode.step_records:
+            if step.teacher_action is None:
+                continue
+            tags = [episode.mode, step.parse_status]
+            if step.failure_reason:
+                tags.append("failure")
+            rows.append(
+                SFTRecord(
+                    source="replay",
+                    scenario_id=episode.scenario_id,
+                    tick=step.tick,
+                    messages=[
+                        {"role": "system", "content": TRAINING_SYSTEM_PROMPT},
+                        {"role": "user", "content": step.prompt_text},
+                    ],
+                    target_action=step.teacher_action,
+                    student_action=step.cleaned_action,
+                    parse_status=step.parse_status,
+                    tags=tags,
+                )
+            )
+    return rows
+def write_jsonl(records: list[SFTRecord], output_path: Path) -> None:
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with output_path.open("w", encoding="utf-8") as handle:
+        for record in records:
+            handle.write(record.model_dump_json())
+            handle.write("\n")
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--source",
+        choices=["baseline", "replay", "combined"],
+        default="combined",
+    )
+    parser.add_argument(
+        "--episodes",
+        default="outputs/trainer/episodes.jsonl",
+    )
+    parser.add_argument(
+        "--output",
+        required=True,
+    )
+    args = parser.parse_args()
+    records: list[SFTRecord] = []
+    if args.source in {"baseline", "combined"}:
+        records.extend(build_baseline_records())
+    if args.source in {"replay", "combined"}:
+        records.extend(build_replay_records(Path(args.episodes)))
+    write_jsonl(records, Path(args.output))
+    print(f"wrote {len(records)} rows to {args.output}")
+if __name__ == "__main__":
+    main()

unified_incident_env/trainer/collect_trajectory.py ADDED Viewed

	@@ -0,0 +1,53 @@

+"""Collection wrapper that turns one episode into trajectory + analysis + summary."""
+from __future__ import annotations
+from .analyze_failures import analyze_episode
+from .types import EpisodeRecord, EpisodeSummaryRecord
+def collect_episode(
+    *,
+    runner,
+    scenario_id: str,
+    episode_id: int,
+    mode: str,
+    model_version: str,
+) -> tuple[EpisodeRecord, EpisodeSummaryRecord, object]:
+    """Run, analyze, and summarize one episode."""
+    record = runner.run(
+        scenario_id=scenario_id,
+        mode=mode,
+        episode_id=episode_id,
+        model_version=model_version,
+    )
+    analysis = analyze_episode(record)
+    record.schema_failures = analysis.summary.get("schema", 0)
+    record.policy_failures = analysis.policy_failures
+    record.looping_failures = analysis.looping_failures
+    record.reasoning_failures = analysis.reasoning_failures
+    summary = EpisodeSummaryRecord(
+        episode_id=episode_id,
+        run_id=record.run_id,
+        scenario_id=record.scenario_id,
+        difficulty=record.difficulty,
+        model_name=record.model_name,
+        model_version=record.model_version,
+        mode=record.mode,
+        steps=record.steps,
+        success=record.success,
+        final_score=record.final_score,
+        schema_failures=analysis.summary.get("schema", 0),
+        json_valid_steps=record.json_valid_steps,
+        strict_schema_valid_steps=record.strict_schema_valid_steps,
+        teacher_override_count=record.teacher_override_count,
+        repair_retry_count=record.repair_retry_count,
+        policy_failures=analysis.policy_failures,
+        looping_failures=analysis.looping_failures,
+        reasoning_failures=analysis.reasoning_failures,
+        security_subquest_completed=record.security_subquest_completed,
+        postmortem_completed=record.postmortem_completed,
+        stopped_reason=record.stopped_reason,
+        elapsed_s=record.elapsed_s,
+    )
+    return record, summary, analysis

unified_incident_env/trainer/eval_models.py ADDED Viewed

	@@ -0,0 +1,145 @@

+"""Batch evaluation for one or more models."""
+from __future__ import annotations
+import argparse
+import json
+import os
+from pathlib import Path
+from ..server.challenge import SCENARIOS
+from .action_adapter import LenientActionAdapter, StrictActionParser
+from .backend import OpenAICompatibleBackend
+from .run_episode import EpisodeRunner
+from .trajectory_store import TrajectoryStore
+from .types import EvalScenarioResult, EvalSummary
+def summarize(results: list[EvalScenarioResult], mode: str) -> EvalSummary:
+    success_rate = (
+        sum(1 for result in results if result.success) / len(results) if results else 0.0
+    )
+    avg_score = (
+        sum(result.final_score for result in results) / len(results) if results else 0.0
+    )
+    schema_failure_rate = (
+        sum(1 for result in results if result.schema_failure) / len(results)
+        if results
+        else 0.0
+    )
+    by_model: dict[str, dict[str, float]] = {}
+    by_scenario: dict[str, dict[str, float]] = {}
+    for result in results:
+        model_bucket = by_model.setdefault(
+            result.model_name,
+            {"runs": 0.0, "successes": 0.0, "score_sum": 0.0, "schema_failures": 0.0},
+        )
+        model_bucket["runs"] += 1
+        model_bucket["successes"] += 1.0 if result.success else 0.0
+        model_bucket["score_sum"] += result.final_score
+        model_bucket["schema_failures"] += 1.0 if result.schema_failure else 0.0
+        scenario_bucket = by_scenario.setdefault(
+            result.scenario_id,
+            {"runs": 0.0, "successes": 0.0, "score_sum": 0.0},
+        )
+        scenario_bucket["runs"] += 1
+        scenario_bucket["successes"] += 1.0 if result.success else 0.0
+        scenario_bucket["score_sum"] += result.final_score
+    for bucket in by_model.values():
+        runs = bucket["runs"] or 1.0
+        bucket["success_rate"] = round(bucket["successes"] / runs, 4)
+        bucket["avg_score"] = round(bucket["score_sum"] / runs, 4)
+        bucket["schema_failure_rate"] = round(bucket["schema_failures"] / runs, 4)
+        del bucket["score_sum"]
+        del bucket["successes"]
+        del bucket["schema_failures"]
+    for bucket in by_scenario.values():
+        runs = bucket["runs"] or 1.0
+        bucket["success_rate"] = round(bucket["successes"] / runs, 4)
+        bucket["avg_score"] = round(bucket["score_sum"] / runs, 4)
+        del bucket["score_sum"]
+        del bucket["successes"]
+    return EvalSummary(
+        mode=mode,
+        results=results,
+        success_rate=round(success_rate, 4),
+        avg_score=round(avg_score, 4),
+        schema_failure_rate=round(schema_failure_rate, 4),
+        by_model=by_model,
+        by_scenario=by_scenario,
+    )
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--models", nargs="+", required=True)
+    parser.add_argument("--mode", choices=["strict", "lenient"], default="strict")
+    parser.add_argument("--base-url", default="http://127.0.0.1:8000")
+    parser.add_argument(
+        "--api-base-url",
+        default=os.environ.get("API_BASE_URL", "http://127.0.0.1:11434/v1"),
+    )
+    parser.add_argument(
+        "--api-key",
+        default=os.environ.get("OPENAI_API_KEY") or os.environ.get("HF_TOKEN") or "local",
+    )
+    parser.add_argument(
+        "--output",
+        default=None,
+    )
+    parser.add_argument(
+        "--episodes-output",
+        default="outputs/trainer/episodes.jsonl",
+    )
+    args = parser.parse_args()
+    backend = OpenAICompatibleBackend(
+        base_url=args.api_base_url,
+        api_key=args.api_key,
+    )
+    parser_impl = StrictActionParser() if args.mode == "strict" else LenientActionAdapter()
+    episode_store = TrajectoryStore(Path(args.episodes_output))
+    results: list[EvalScenarioResult] = []
+    for model_name in args.models:
+        runner = EpisodeRunner(
+            backend=backend,
+            parser=parser_impl,
+            model_name=model_name,
+            base_url=args.base_url,
+        )
+        for scenario_id in SCENARIOS:
+            episode = runner.run(scenario_id=scenario_id, mode=args.mode)
+            episode_store.append_episode(episode)
+            results.append(
+                EvalScenarioResult(
+                    model_name=model_name,
+                    scenario_id=scenario_id,
+                    success=episode.success,
+                    final_score=episode.final_score,
+                    failure_reason=episode.failure_reason,
+                    schema_failure=bool(
+                        episode.failure_reason
+                        and episode.failure_reason.startswith("parse_failure")
+                    ),
+                    elapsed_s=episode.elapsed_s,
+                )
+            )
+    summary = summarize(results, mode=args.mode)
+    output_path = Path(
+        args.output
+        or f"outputs/trainer/{args.mode}_eval_summary.json"
+    )
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    output_path.write_text(summary.model_dump_json(indent=2), encoding="utf-8")
+    print(summary.model_dump_json(indent=2))
+if __name__ == "__main__":
+    main()