Spaces:

Madhav189
/

SystemTruth

Running

dakshdoesdev Claude Opus 4.7 (1M context) commited on Apr 23

Commit

0bf41ea

1 Parent(s): c8bef53

Harden env + ship Claude skill, OpenClaw-RL shim, training pipeline

Env hardening (Stream 1):
- Rebalance grader into 7 dimensions with speed_bonus + noise_handling_score
so scripted baseline ceiling drops 0.99 -> 0.74, creating training headroom
- Surface noise_alerts, blast_radius, noise_queries on observation + state
- Extend ServiceName literal with the noise-service pool

Procedural scenarios:
- Each of the 3 hand-crafted templates spawns 4 jittered variants via seeded
RNG: metric noise, deploy timing, rotated noise-service selection
- 15 scenarios total (5 per difficulty), baseline-resolvable via template_id
dispatch in _baseline_actions

OpenClaw-RL integration shim:
- openclaw_integration/pool_server.py: FastAPI lease-based session server,
asyncio-locked per-lease, TTL reaper for idle cleanup
- openclaw_integration/sre_env_client.py: drop-in shape match with
terminal-rl/env_client.py
- README documents the one-line import patch for terminal-rl/generate.py

Claude Code skill (v0 pitch):
- skill/SKILL.md with investigation methodology + decoy ground truth
- skill/tools/sre_gym_client.py CLI: list / solve / interactive /
record-runbook
- skill/verified-runbooks/ seeded with clean traces of all 3 templates

Training pipeline:
- train/sanity_run.ipynb: Colab-ready Qwen3.5-4B (Qwen3-4B fallback) Unsloth
LoRA SFT dry-run, 200 toy steps, wandb
- train/collect_trajectories.py: parallel async harness with anthropic +
heuristic drivers, uses UnifiedIncidentEnv WebSocket client for state
persistence
- train/requirements-train.txt: pinned Unsloth + TRL + wandb + anthropic

Demo + deploy:
- demo/run_demo.sh + pitch.md: 60-second demo script, 3 solves + runbook
accumulation
- deploy/push_to_hf.sh: HF Space deploy helper (env vars: HF_TOKEN, HF_SPACE_ID)
- README rewritten to lead with the 30-second install + architecture diagram
- openenv.yaml: difficulties [easy, medium, hard]; space_id dakshdoesdev/sre-gym

Test suite: 21 -> 29 passing. openenv validate green. Live Space:
https://dakshdoesdev-sre-gym.hf.space

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (28) hide show

.gitignore +3 -0
.sisyphus/plans/reward-redesign.md +0 -609
README.md +95 -52
demo/pitch.md +49 -0
demo/run_demo.sh +70 -0
deploy/push_to_hf.sh +58 -0
openclaw_integration/README.md +80 -0
openclaw_integration/__init__.py +6 -0
openclaw_integration/generate_with_sre.py +27 -0
openclaw_integration/pool_server.py +273 -0
openclaw_integration/sre_env_client.py +159 -0
openenv.yaml +2 -2
skill/SKILL.md +100 -0
skill/tools/sre_gym_client.py +238 -0
skill/verified-runbooks/.gitkeep +0 -0
skill/verified-runbooks/db_config_rollout.md +23 -0
skill/verified-runbooks/gateway_auth_rollout.md +21 -0
skill/verified-runbooks/worker_deploy_cascade.md +23 -0
train/collect_trajectories.py +471 -0
train/requirements-train.txt +18 -0
train/sanity_run.ipynb +326 -0
unified_incident_env/models.py +22 -2
unified_incident_env/server/app.py +1 -1
unified_incident_env/server/challenge.py +670 -16
unified_incident_env/server/environment.py +148 -117
unified_incident_env/server/grader.py +88 -19
unified_incident_env/tests/test_environment.py +147 -4
uv.lock +0 -0

.gitignore CHANGED Viewed

@@ -7,3 +7,6 @@ learning_curve.png
 .codex/
 outputs/
 AGENTS.md

 .codex/
 outputs/
 AGENTS.md
+.sisyphus/
+*.egg-info/
+uv.lock

.sisyphus/plans/reward-redesign.md DELETED Viewed

@@ -1,609 +0,0 @@
-# Reward Redesign for Unified Incident Env
-## TL;DR
-> **Summary**: Replace breadcrumb-based rewards with a world-state-based reward system: normalized step cost, incident-health delta shaping, a tiny non-farmable hypothesis-quality bonus, and terminal bonuses/penalties tied to verified containment and recovery. Keep the public deterministic benchmark score separate from training reward, but remove breadcrumb terms from both.
-> **Deliverables**:
-> - Reworked training-time step reward in `unified_incident_env/server/environment.py`
-> - Reworked public deterministic score in `unified_incident_env/server/grader.py`
-> - Structured hypothesis payload on `classify_vulnerability`
-> - Scenario-authored critical-path service weights and reward config in `server/challenge.py`
-> - Updated prompts/inference/tests for the structured hypothesis contract
-> - Regression tests proving breadcrumb rewards are gone and world-improving actions dominate
-> **Effort**: Large
-> **Parallel**: YES - 4 waves
-> **Critical Path**: Task 1 → Task 2 → Task 3 → Task 6
-## Context
-### Original Request
-Redesign the reward system so points come from world improvement, step cost, and small calibrated hypothesis quality rather than from the environment revealing the “correct branch.” Keep the env compatible with Gymnasium/OpenEnv-style RL where every `step(action)` returns reward, but tie reward to state-transition quality rather than clue clicks.
-### Interview Summary
-- Reward must still be emitted on every step.
-- Investigation actions should mostly cost time and should not directly reward clue discovery.
-- Hypothesis actions can receive a small score for decision quality: root-cause accuracy, service localization, confidence calibration, and recommended next action quality.
-- Big rewards should remain tied to actual containment, verified recovery, and correct final resolution.
-- Reward shaping should follow the spirit of potential-based shaping: dense guidance via better state, not better clue collection.
-- Training can run on Colab/Kaggle; environment logic remains local.
-### Metis Review (gaps addressed)
-- Added a strict **reward whitelist** and **forbidden-source blacklist**.
-- Made hypothesis reward explicitly one-time and non-farmable.
-- Separated training reward from public deterministic benchmark score.
-- Normalized step costs by scenario budget to avoid punishing longer scenarios unfairly.
-- Added explicit regression checks for reward/public-score drift.
-- Resolved hidden ambiguity: reuse `classify_vulnerability` instead of introducing a new `submit_hypothesis` action.
-## Work Objectives
-### Core Objective
-Refactor the benchmark so the agent learns from state improvement and decision quality, not from authored breadcrumb rewards, while preserving a deterministic public evaluation contract.
-### Deliverables
-- `server/environment.py` returns step rewards based on:
-  - normalized step cost
-  - delta incident-health potential
-  - one-time hypothesis bonus/penalty
-  - terminal outcome bonus/penalty
-  - explicit unsafe/redundant action penalties
-- `server/grader.py` computes public `final_score` without rewarding evidence discovery, patch-id guessing, or stage progression by itself.
-- `server/challenge.py` contains per-scenario critical-path service weights and reward-config metadata.
-- `models.py` extends `classify_vulnerability` payload to carry hypothesis scoring fields.
-- `trainer/prompts.py` and `inference.py` understand the structured hypothesis payload.
-- Tests cover reward decomposition, non-farmable hypothesis scoring, and terminal correctness.
-### Definition of Done (verifiable conditions with commands)
-- `./.venv/bin/pytest unified_incident_env/tests -q` exits 0.
-- For a fixed scenario, a pure query action yields only step cost / redundancy effects, not positive breadcrumb reward.
-- For a fixed scenario, verified containment/recovery yields positive reward deltas.
-- Repeating the same hypothesis does not mint additional bonus.
-- Public deterministic score no longer uses `relevant_investigations` or any direct clue-count term.
-### Must Have
-- No direct positive reward for evidence discovery, unlock events, query success, patch-id selection, or stage advancement.
-- Incident-health potential derived only from verified/public world state.
-- `classify_vulnerability` supports structured hypothesis scoring with cause, services, confidence, and next action.
-- Training reward and public score are both documented and distinguishable.
-### Must NOT Have
-- No new `submit_hypothesis` action unless the existing `classify_vulnerability` path proves insufficient during implementation review.
-- No hidden proxy breadcrumb reward through internal fields like `matched_evidence_ids`, `unlock_threshold`, or `infra_progress`.
-- No reward mutation outside the actual returned `reward` from `step()`.
-- No acceptance criteria that depend on human eyeballing logs.
-## Verification Strategy
-> ZERO HUMAN INTERVENTION - all verification is agent-executed.
-- Test decision: tests-after with existing `pytest` suite plus new deterministic reward regression tests.
-- QA policy: every implementation task includes agent-executed assertions on reward sign/magnitude and action/schema behavior.
-- Evidence: `.sisyphus/evidence/task-{N}-{slug}.{ext}`
-## Execution Strategy
-### Parallel Execution Waves
-Wave 1: reward-model foundation and schema decisions
-- Task 1: define allowed/forbidden reward sources and scenario reward config
-- Task 2: extend action/state schema for structured hypotheses
-- Task 3: implement incident-health potential helpers
-Wave 2: core scoring rewrite
-- Task 4: replace step reward logic in environment
-- Task 5: replace public deterministic score breakdown
-- Task 6: update scenario metadata and authored weights
-Wave 3: contract consumers
-- Task 7: update prompts, response schema, and parser expectations
-- Task 8: update inference fallback/hypothesis generation
-- Task 9: update baseline/walkthrough/tests for new hypothesis payload
-Wave 4: regression and training-path hardening
-- Task 10: add reward decomposition/regression tests
-- Task 11: add reward/public-score drift checks for fixed scenarios
-- Task 12: document Colab/Kaggle GRPO usage against the new reward semantics
-### Dependency Matrix (full, all tasks)
-- Task 1 blocks Tasks 3, 4, 5, 6.
-- Task 2 blocks Tasks 7, 8, 9.
-- Task 3 blocks Task 4.
-- Task 4 blocks Tasks 10 and 11.
-- Task 5 blocks Task 11.
-- Task 6 blocks Task 4 and Task 5.
-- Task 7 blocks Task 8 and Task 9.
-- Task 8 blocks Task 12.
-- Task 9 blocks Task 10.
-- Tasks 10 and 11 block final verification wave.
-### Agent Dispatch Summary
-- Wave 1 → 3 tasks → deep / oracle-consulted / quick
-- Wave 2 → 3 tasks → deep / unspecified-high
-- Wave 3 → 3 tasks → quick / unspecified-high
-- Wave 4 → 3 tasks → quick / writing / unspecified-high
-## TODOs
-> Implementation + Test = ONE task. Never separate.
-> EVERY task MUST have: Agent Profile + Parallelization + QA Scenarios.
-- [ ] 1. Define reward whitelist, blacklist, and config schema
-  **What to do**: Add a single source of truth for reward terms in `server/challenge.py` or a nearby reward-config module. Define which signals are allowed to contribute to training reward and which are forbidden. Add per-scenario `critical_service_weights`, `step_cost_scale`, and hypothesis-bonus constants. Remove authored dependence on clue/evidence counts from the new reward path.
-  **Must NOT do**: Do not yet rewrite reward logic in `environment.py`; do not add a new action type.
-  **Recommended Agent Profile**:
-  - Category: `deep` - Reason: this is the architecture lock for all later reward logic.
-  - Skills: `[]` - no special skill required.
-  - Omitted: `[omarchy]` - unrelated domain.
-  **Parallelization**: Can Parallel: NO | Wave 1 | Blocks: 3,4,5,6 | Blocked By: none
-  **References**:
-  - Pattern: `unified_incident_env/server/challenge.py:96-156,284-345,486-546` - current evidence/unlock/verify metadata to replace or augment.
-  - Pattern: `unified_incident_env/server/environment.py:263-323` - current breadcrumb reward path.
-  - Pattern: `unified_incident_env/server/grader.py:73-128` - current public score terms.
-  - External: `https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf` - shaping must preserve the right objective.
-  - External: `https://github.com/Farama-Foundation/Gymnasium` - step rewards should reflect environment transition quality.
-  **Acceptance Criteria**:
-  - [ ] Reward config defines `critical_service_weights` summing to 1.0 for every scenario.
-  - [ ] Reward config explicitly lists forbidden reward sources: evidence discovery, clue unlock, patch-id correctness, stage advancement, query success.
-  - [ ] Existing scenario fixtures still load successfully.
-  **QA Scenarios**:
-  ```
-  Scenario: Reward config loads for all scenarios
-    Tool: Bash
-    Steps: Run a Python one-liner importing all scenarios and validating weight sums and required keys.
-    Expected: Exit 0; every scenario has complete reward config and valid normalized weights.
-    Evidence: .sisyphus/evidence/task-1-reward-config.txt
-  Scenario: Forbidden-source list is complete
-    Tool: Bash
-    Steps: Grep config and associated tests for all banned terms.
-    Expected: Forbidden-source entries exist and are asserted in tests.
-    Evidence: .sisyphus/evidence/task-1-reward-config-grep.txt
-  ```
-  **Commit**: YES | Message: `refactor(rewards): define shaping config and forbidden reward sources` | Files: `unified_incident_env/server/challenge.py`, nearby config module, tests
-- [ ] 2. Extend `classify_vulnerability` into a structured hypothesis commit
-  **What to do**: Modify `UnifiedIncidentAction` so `classify_vulnerability` carries a structured hypothesis payload: `vulnerability_type`, `affected_services`, `confidence`, and `recommended_next_action`. Update validators, observation/state mirrors if needed, and any schema-generation logic that relies on action fields.
-  **Must NOT do**: Do not add `submit_hypothesis`; do not break existing parsing for valid old payloads without an explicit migration path.
-  **Recommended Agent Profile**:
-  - Category: `unspecified-high` - Reason: touches schema, parser expectations, and compatibility.
-  - Skills: `[]`
-  - Omitted: `[omarchy]`
-  **Parallelization**: Can Parallel: YES | Wave 1 | Blocks: 7,8,9 | Blocked By: none
-  **References**:
-  - Pattern: `unified_incident_env/models.py:11-67` - current action schema.
-  - Pattern: `unified_incident_env/trainer/prompts.py:216-230,385-405` - required-field and example generation.
-  - Pattern: `unified_incident_env/tests/test_environment.py:333-345` - public action schema lock.
-  - Pattern: `unified_incident_env/tests/test_trainer.py:45-107` - parser behavior expectations.
-  **Acceptance Criteria**:
-  - [ ] `classify_vulnerability` requires the new structured fields.
-  - [ ] Existing explicit valid actions with complete fields parse successfully.
-  - [ ] Tests cover missing `confidence`, malformed `affected_services`, and invalid recommended action values.
-  **QA Scenarios**:
-  ```
-  Scenario: Structured hypothesis validates
-    Tool: Bash
-    Steps: Construct a valid classify_vulnerability action via Python and print model_dump.
-    Expected: Exit 0; payload includes all structured hypothesis fields.
-    Evidence: .sisyphus/evidence/task-2-hypothesis-valid.txt
-  Scenario: Invalid hypothesis is rejected
-    Tool: Bash
-    Steps: Construct invalid actions missing required hypothesis fields.
-    Expected: Validation raises deterministic errors.
-    Evidence: .sisyphus/evidence/task-2-hypothesis-invalid.txt
-  ```
-  **Commit**: YES | Message: `feat(schema): structure vulnerability classification as scored hypothesis` | Files: `unified_incident_env/models.py`, parsers, tests
-- [ ] 3. Implement incident-health potential helpers
-  **What to do**: Add helper functions in `server/environment.py` (or a sibling reward helper module) to compute `operational_health`, `security_health`, and `incident_health_potential` from public/verified state only. Use service-status values `healthy=1.0`, `degraded=0.4`, `crashed=0.0`, weighted by scenario-authored critical-path weights.
-  **Must NOT do**: Do not compute potential from evidence counters, stage names, recovery index, or hidden authored truth labels.
-  **Recommended Agent Profile**:
-  - Category: `quick` - Reason: local pure-function implementation once config is fixed.
-  - Skills: `[]`
-  - Omitted: `[omarchy]`
-  **Parallelization**: Can Parallel: NO | Wave 1 | Blocks: 4 | Blocked By: 1
-  **References**:
-  - Pattern: `unified_incident_env/server/environment.py:501-516,556-560,692-700,856-947`
-  - Pattern: `unified_incident_env/models.py:132-164,176-250`
-  - External: Ng/Harada/Russell shaping paper above.
-  **Acceptance Criteria**:
-  - [ ] Potential helpers are pure and deterministic.
-  - [ ] Potential increases when critical-path services improve.
-  - [ ] Potential does not change from evidence-only discoveries when service/security health stays the same.
-  **QA Scenarios**:
-  ```
-  Scenario: Potential rises on service recovery
-    Tool: Bash
-    Steps: Create before/after state fixtures with one critical service moving crashed -> healthy.
-    Expected: after_potential > before_potential.
-    Evidence: .sisyphus/evidence/task-3-potential-rise.txt
-  Scenario: Evidence-only change has no positive shaping
-    Tool: Bash
-    Steps: Compare states that differ only by evidence counters/unlock flags.
-    Expected: potential delta == 0.
-    Evidence: .sisyphus/evidence/task-3-potential-no-breadcrumb.txt
-  ```
-  **Commit**: YES | Message: `refactor(rewards): add incident-health potential helpers` | Files: `unified_incident_env/server/environment.py`, tests
-- [ ] 4. Rewrite environment step rewards around delta health + cost + penalties
-  **What to do**: Replace per-handler positive breadcrumb rewards with a single post-transition reward computation based on `gamma * Φ(s') - Φ(s)`, normalized step cost, tiny hypothesis bonus/penalty, and explicit unsafe/redundant-action surcharges. Ensure repeated-action penalties flow through returned `reward`, not hidden cumulative mutations.
-  **Must NOT do**: Do not keep direct `+0.05` query rewards, direct patch-id credit, or verify-button credit.
-  **Recommended Agent Profile**:
-  - Category: `deep` - Reason: central behavior change with many edge cases.
-  - Skills: `[]`
-  - Omitted: `[omarchy]`
-  **Parallelization**: Can Parallel: NO | Wave 2 | Blocks: 10,11 | Blocked By: 1,3,6
-  **References**:
-  - Pattern: `unified_incident_env/server/environment.py:103-177,263-323,325-554,569-601`
-  - Pattern: `unified_incident_env/tests/test_environment.py:205-232,307-330`
-  - Pattern: `unified_incident_env/server/challenge.py` reward-relevant scenario metadata after Task 1.
-  **Acceptance Criteria**:
-  - [ ] Query/evidence actions emit only step cost or redundancy penalty unless the underlying world state improves.
-  - [ ] Wrong/harmful actions emit negative reward.
-  - [ ] Verified service recovery and exploit containment emit positive reward due to state improvement.
-  - [ ] No hidden mutation adjusts cumulative reward independently of returned reward.
-  **QA Scenarios**:
-  ```
-  Scenario: Investigation no longer gives breadcrumb reward
-    Tool: Bash
-    Steps: Run a fixed scenario reset then a single query action that only reveals evidence.
-    Expected: reward <= 0, with no positive breadcrumb term.
-    Evidence: .sisyphus/evidence/task-4-no-query-reward.txt
-  Scenario: Verified recovery yields positive reward
-    Tool: Bash
-    Steps: Execute a known-good mitigation step that improves critical service health.
-    Expected: reward > 0 and health potential increases.
-    Evidence: .sisyphus/evidence/task-4-recovery-positive.txt
-  ```
-  **Commit**: YES | Message: `refactor(rewards): score steps by health delta and normalized costs` | Files: `unified_incident_env/server/environment.py`, tests
-- [ ] 5. Rewrite public deterministic score to remove breadcrumb terms
-  **What to do**: Update `server/grader.py` so `final_score` reflects verified operational recovery, verified security completion, efficiency, and postmortem quality without direct investigation-count or patch-id-guess terms. Preserve deterministic scoring/report shape.
-  **Must NOT do**: Do not make public score depend on hidden health potential internals or trainer-specific gamma.
-  **Recommended Agent Profile**:
-  - Category: `unspecified-high` - Reason: public benchmark semantics change.
-  - Skills: `[]`
-  - Omitted: `[omarchy]`
-  **Parallelization**: Can Parallel: YES | Wave 2 | Blocks: 11 | Blocked By: 1,6
-  **References**:
-  - Pattern: `unified_incident_env/server/grader.py:68-201`
-  - Pattern: `unified_incident_env/tests/test_environment.py:349-388`
-  **Acceptance Criteria**:
-  - [ ] `relevant_investigations` is no longer part of `infrastructure_score`.
-  - [ ] `selected_patch` or `selected_vulnerability` alone do not award public score before verification/completion.
-  - [ ] Existing report/check structure remains deterministic.
-  **QA Scenarios**:
-  ```
-  Scenario: Breadcrumb-only progress does not lift public score
-    Tool: Bash
-    Steps: Build a grader state with evidence collected but no verified containment/recovery.
-    Expected: score remains low and below resolved benchmark thresholds.
-    Evidence: .sisyphus/evidence/task-5-no-breadcrumb-public-score.txt
-  Scenario: Verified containment and recovery dominate score
-    Tool: Bash
-    Steps: Compare partial state vs fully recovered/verified state in grader.
-    Expected: fully recovered score > partial score.
-    Evidence: .sisyphus/evidence/task-5-public-score-compare.txt
-  ```
-  **Commit**: YES | Message: `refactor(grader): remove breadcrumb terms from public score` | Files: `unified_incident_env/server/grader.py`, tests
-- [ ] 6. Add scenario-authored reward metadata and critical-path weights
-  **What to do**: Extend each scenario in `server/challenge.py` with deterministic critical-path service weights and reward metadata used by Tasks 3–5. Ensure these weights are scenario-local and normalized.
-  **Must NOT do**: Do not infer weights dynamically from evidence or runtime guesses.
-  **Recommended Agent Profile**:
-  - Category: `quick`
-  - Skills: `[]`
-  - Omitted: `[omarchy]`
-  **Parallelization**: Can Parallel: YES | Wave 2 | Blocks: 4,5 | Blocked By: 1
-  **References**:
-  - Pattern: `unified_incident_env/server/challenge.py:96-199,284-403,486-610`
-  **Acceptance Criteria**:
-  - [ ] Every scenario includes valid reward metadata.
-  - [ ] Hard scenario weights emphasize worker/database path appropriately.
-  - [ ] Tests verify normalization and required keys.
-  **QA Scenarios**:
-  ```
-  Scenario: Scenario reward metadata validates
-    Tool: Bash
-    Steps: Import all scenarios and validate reward metadata shape.
-    Expected: Exit 0; all scenarios satisfy schema.
-    Evidence: .sisyphus/evidence/task-6-scenario-metadata.txt
-  Scenario: Weight normalization is enforced
-    Tool: Bash
-    Steps: Sum critical_service_weights for each scenario.
-    Expected: Each sum == 1.0 within tolerance.
-    Evidence: .sisyphus/evidence/task-6-weight-sums.txt
-  ```
-  **Commit**: YES | Message: `feat(challenge): add critical-path service weights for reward shaping` | Files: `unified_incident_env/server/challenge.py`, tests
-- [ ] 7. Update trainer prompt/schema generation for structured hypotheses
-  **What to do**: Update `trainer/prompts.py` and parser-adjacent tests so `classify_vulnerability` examples and required fields include `affected_services`, `confidence`, and `recommended_next_action`. Fix the verification-stage mismatch explicitly if still present after schema changes.
-  **Must NOT do**: Do not leak teacher actions into runtime prompts.
-  **Recommended Agent Profile**:
-  - Category: `quick`
-  - Skills: `[]`
-  - Omitted: `[omarchy]`
-  **Parallelization**: Can Parallel: YES | Wave 3 | Blocks: 8,9 | Blocked By: 2
-  **References**:
-  - Pattern: `unified_incident_env/trainer/prompts.py:96-148,385-434`
-  - Pattern: `unified_incident_env/tests/test_trainer.py:229-253`
-  **Acceptance Criteria**:
-  - [ ] Runtime prompt examples for `classify_vulnerability` include the structured hypothesis payload.
-  - [ ] `strict` and `lenient` behavior remain meaningfully distinct.
-  - [ ] Verification-stage action table is internally consistent across environment and prompt schema.
-  **QA Scenarios**:
-  ```
-  Scenario: Prompt shows structured hypothesis example
-    Tool: Bash
-    Steps: Build a runtime request in security_subquest stage.
-    Expected: User prompt contains hypothesis fields and valid JSON example.
-    Evidence: .sisyphus/evidence/task-7-prompt-hypothesis.txt
-  Scenario: Strict mode remains stricter
-    Tool: Bash
-    Steps: Compare strict and lenient runtime requests with correction memory text.
-    Expected: strict omits lenient correction hints.
-    Evidence: .sisyphus/evidence/task-7-strict-vs-lenient.txt
-  ```
-  **Commit**: YES | Message: `feat(trainer): prompt structured vulnerability hypotheses` | Files: `unified_incident_env/trainer/prompts.py`, tests
-- [ ] 8. Update inference fallback and schema handling for structured hypotheses
-  **What to do**: Update `inference.py` so structured hypothesis payloads are generated, parsed, and repaired consistently. Keep the already-fixed verification-failure fallback behavior intact.
-  **Must NOT do**: Do not reintroduce heuristic loops that bypass the new structured contract.
-  **Recommended Agent Profile**:
-  - Category: `unspecified-high`
-  - Skills: `[]`
-  - Omitted: `[omarchy]`
-  **Parallelization**: Can Parallel: YES | Wave 3 | Blocks: 12 | Blocked By: 2,7
-  **References**:
-  - Pattern: `inference.py:279-472,475-568,865-905,996-1094,1190-1241`
-  - Pattern: `unified_incident_env/tests/test_submission_inference.py:99-166,205-355`
-  **Acceptance Criteria**:
-  - [ ] Fallback classification outputs valid structured hypotheses.
-  - [ ] Repeated verification failures still return to patching.
-  - [ ] Submission inference tests cover malformed hypothesis payloads.
-  **QA Scenarios**:
-  ```
-  Scenario: Fallback builds structured hypothesis
-    Tool: Bash
-    Steps: Build fallback action in security_subquest before patching.
-    Expected: classify_vulnerability action includes services, confidence, and next action fields.
-    Evidence: .sisyphus/evidence/task-8-fallback-hypothesis.txt
-  Scenario: Verification failure still re-patches
-    Tool: Bash
-    Steps: Reproduce failed verification state.
-    Expected: narrowed actions and fallback choose apply_patch, not re-verify.
-    Evidence: .sisyphus/evidence/task-8-repatch-after-failed-verify.txt
-  ```
-  **Commit**: YES | Message: `feat(inference): emit structured hypotheses and preserve safe fallback` | Files: `inference.py`, tests
-- [ ] 9. Update baselines and walkthroughs for new hypothesis payload
-  **What to do**: Update `scripts/baseline_agent.py`, walkthroughs, and any deterministic sample flows so they emit the structured `classify_vulnerability` action. Keep exact scenario solutions intact.
-  **Must NOT do**: Do not alter scenario truth or recovery order here.
-  **Recommended Agent Profile**:
-  - Category: `quick`
-  - Skills: `[]`
-  - Omitted: `[omarchy]`
-  **Parallelization**: Can Parallel: YES | Wave 3 | Blocks: 10 | Blocked By: 2,7
-  **References**:
-  - Pattern: `unified_incident_env/scripts/baseline_agent.py`
-  - Pattern: `unified_incident_env/scripts/walkthrough.py`
-  - Pattern: `unified_incident_env/tests/test_environment.py` happy-path helpers.
-  **Acceptance Criteria**:
-  - [ ] Baseline agent still solves all three scenarios.
-  - [ ] Structured hypothesis payload appears in the baseline classify step.
-  **QA Scenarios**:
-  ```
-  Scenario: Baseline still solves preset pack
-    Tool: Bash
-    Steps: Run the baseline walkthrough or equivalent deterministic script/tests.
-    Expected: All scenarios resolve successfully.
-    Evidence: .sisyphus/evidence/task-9-baseline-solves.txt
-  Scenario: Baseline classify step is structured
-    Tool: Bash
-    Steps: Print the classify_vulnerability payload from the baseline plan.
-    Expected: Includes new hypothesis fields.
-    Evidence: .sisyphus/evidence/task-9-baseline-structured-hypothesis.txt
-  ```
-  **Commit**: YES | Message: `refactor(baseline): emit structured classification hypotheses` | Files: baseline/walkthrough/tests
-- [ ] 10. Add reward decomposition and anti-breadcrumb regression tests
-  **What to do**: Add deterministic environment tests proving query/evidence actions no longer receive positive breadcrumb rewards and that repeated hypotheses do not farm reward.
-  **Must NOT do**: Do not rely on broad “final score looks okay” assertions alone.
-  **Recommended Agent Profile**:
-  - Category: `unspecified-high`
-  - Skills: `[]`
-  - Omitted: `[omarchy]`
-  **Parallelization**: Can Parallel: YES | Wave 4 | Blocks: Final verification | Blocked By: 4,9
-  **References**:
-  - Pattern: `unified_incident_env/tests/test_environment.py:205-232,235-372`
-  **Acceptance Criteria**:
-  - [ ] Pure evidence gathering has no positive breadcrumb reward.
-  - [ ] Duplicate hypothesis submissions gain at most one bonus.
-  - [ ] Harmful actions are negative.
-  **QA Scenarios**:
-  ```
-  Scenario: Duplicate hypothesis bonus is one-time only
-    Tool: Bash
-    Steps: Submit same classify_vulnerability payload twice in a deterministic scenario.
-    Expected: First bonus sign as designed; second bonus == 0 or negative cost only.
-    Evidence: .sisyphus/evidence/task-10-hypothesis-dedupe.txt
-  Scenario: Evidence-only step is non-positive
-    Tool: Bash
-    Steps: Reset then perform one diagnostic query.
-    Expected: reward <= 0.
-    Evidence: .sisyphus/evidence/task-10-evidence-nonpositive.txt
-  ```
-  **Commit**: YES | Message: `test(rewards): add anti-breadcrumb and hypothesis-dedupe regressions` | Files: environment tests
-- [ ] 11. Add reward/public-score drift regression checks
-  **What to do**: Create fixed-scenario comparisons proving that policies improving training reward also improve or at least align with public deterministic score ordering. Compare bad, partial, and good trajectories.
-  **Must NOT do**: Do not require exact equality between training reward sums and final score.
-  **Recommended Agent Profile**:
-  - Category: `deep`
-  - Skills: `[]`
-  - Omitted: `[omarchy]`
-  **Parallelization**: Can Parallel: YES | Wave 4 | Blocks: Final verification | Blocked By: 4,5
-  **References**:
-  - Pattern: `unified_incident_env/server/grader.py`
-  - Pattern: `unified_incident_env/server/environment.py`
-  - Pattern: existing happy/trap path tests in `tests/test_environment.py`.
-  **Acceptance Criteria**:
-  - [ ] Good trajectory > partial trajectory > harmful trajectory in public score.
-  - [ ] Good trajectory accumulates better training reward than harmful trajectory.
-  - [ ] No scenario shows breadcrumb-only trajectories outranking true containment/recovery.
-  **QA Scenarios**:
-  ```
-  Scenario: Reward/public-score ordering aligns
-    Tool: Bash
-    Steps: Execute scripted bad, partial, and good trajectories for a fixed scenario.
-    Expected: reward/public-score ordering is monotonic in the desired direction.
-    Evidence: .sisyphus/evidence/task-11-ordering.txt
-  Scenario: Breadcrumb trajectory cannot win
-    Tool: Bash
-    Steps: Run a query-heavy but unrecovered trajectory.
-    Expected: Its public score and reward stay below a truly recovered trajectory.
-    Evidence: .sisyphus/evidence/task-11-no-breadcrumb-win.txt
-  ```
-  **Commit**: YES | Message: `test(rewards): add reward-vs-public-score ordering checks` | Files: environment/grader tests
-- [ ] 12. Document Colab/Kaggle GRPO usage with the new reward semantics
-  **What to do**: Update docs/runbooks so training happens on Colab/Kaggle while the environment runs locally or via Docker. Explain the separation between training reward and public deterministic benchmark score, and point to the exact verification commands.
-  **Must NOT do**: Do not leave the old reward explanation in README/execution docs.
-  **Recommended Agent Profile**:
-  - Category: `writing`
-  - Skills: `[]`
-  - Omitted: `[omarchy]`
-  **Parallelization**: Can Parallel: YES | Wave 4 | Blocks: Final verification | Blocked By: 8
-  **References**:
-  - Pattern: `README.md`, `execution.md`, any training docs in repo.
-  - External: `https://huggingface.co/docs/trl/en/openenv` - OpenEnv+TRL integration.
-  **Acceptance Criteria**:
-  - [ ] Docs explain training reward vs public score distinction.
-  - [ ] Docs list the exact local test commands.
-  - [ ] Docs specify Colab/Kaggle training and local/docker env execution.
-  **QA Scenarios**:
-  ```
-  Scenario: Docs mention reward/public-score split
-    Tool: Bash
-    Steps: Grep updated docs for training reward, public score, and verification commands.
-    Expected: All required topics present.
-    Evidence: .sisyphus/evidence/task-12-doc-grep.txt
-  Scenario: Docs commands are runnable
-    Tool: Bash
-    Steps: Execute at least one documented local verification command.
-    Expected: Exit 0.
-    Evidence: .sisyphus/evidence/task-12-doc-command.txt
-  ```
-  **Commit**: YES | Message: `docs(rewards): document shaping semantics and training workflow` | Files: docs/readme/runbooks
-## Final Verification Wave (MANDATORY — after ALL implementation tasks)
-> 4 review agents run in PARALLEL. ALL must APPROVE. Present consolidated results to user and get explicit "okay" before completing.
-> **Do NOT auto-proceed after verification. Wait for user's explicit approval before marking work complete.**
-> **Never mark F1-F4 as checked before getting user's okay.** Rejection or user feedback -> fix -> re-run -> present again -> wait for okay.
-- [ ] F1. Plan Compliance Audit — oracle
-- [ ] F2. Code Quality Review — unspecified-high
-- [ ] F3. Real Manual QA — unspecified-high (+ playwright if UI)
-- [ ] F4. Scope Fidelity Check — deep
-## Commit Strategy
-- Commit 1: reward config + scenario metadata
-- Commit 2: structured hypothesis schema
-- Commit 3: health potential helpers
-- Commit 4: environment reward rewrite
-- Commit 5: grader rewrite
-- Commit 6: prompt/inference/baseline contract updates
-- Commit 7: regression tests + docs
-## Success Criteria
-- Training reward is driven by world-state improvement, not breadcrumb discovery.
-- Public deterministic benchmark score no longer rewards evidence-count collection or raw patch-id guessing.
-- `classify_vulnerability` supports calibrated, non-farmable hypothesis scoring.
-- Query/evidence/unlock actions are not directly profitable.
-- Verified containment + verified recovery dominate both reward and public score ordering.
-- All tests and deterministic regression checks pass.

README.md CHANGED Viewed

@@ -1,74 +1,117 @@
-# SRE Engineer LLM (v2): The Honest SRE Simulator
-`sre-engineer-llm` is a high-fidelity Reinforcement Learning (RL) environment designed to train and evaluate AI agents on **Site Reliability Engineering (SRE)** and **Incident Response**.
-Unlike traditional "scripted" environments, this benchmark uses an honest, world-state-based simulation where agents must diagnose, mitigate, and resolve production outages without "cheating" through prompt oracles or hardcoded rails.
-## 🚀 Key Features
-- **Honest Simulation:** No stage-locks or hidden oracles. All actions are available at all times.
-- **State-Based Transitions:** Remediation actions (like `rollback_deploy` or `restart_service`) directly affect the health metrics of the simulated services.
-- **Verification Driven:** Agents must explicitly run health checks (`run_check`) to verify recovery before declaring an incident resolved.
-- **Realistic SRE Stack:** Includes queries for logs, metrics, dependencies, and deployment history across a microservices topology.
-- **Deterministic Grading:** A transparent scoring system based on final system health, user impact, and operational efficiency.
-## 🛠 Action Space
-The agent has access to 11 discrete SRE tools:
-| Action | Description |
-| :--- | :--- |
-| `query_logs` | Inspect service-level error logs and traces. |
-| `query_metrics` | Retrieve CPU, Memory, or Latency data. |
-| `query_dependencies` | Map upstream and downstream service links. |
-| `query_deploys` | Check the deployment history for recent changes. |
-| `rollback_deploy` | Revert a service to its previous stable version. |
-| `restart_service` | Reboot a crashed or degraded service. |
-| `isolate_service` | Cut traffic to a service to contain blast radius. |
-| `submit_hypothesis` | Record a calibrated guess of the root cause. |
-| `run_check` | Execute a health/verification check on the system. |
-| `declare_resolved` | Finalize the incident after recovery is verified. |
-| `escalate` | Request expert attention (no-op in simulation). |
-## 📁 Project Structure
-- `unified_incident_env/`
-  - `server/`: The FastAPI-based environment server.
-    - `environment.py`: Core simulator logic and world-state transitions.
-    - `challenge.py`: Scenario catalog and baseline definitions.
-    - `grader.py`: Deterministic scoring and reporting logic.
-  - `models.py`: Pydantic schemas for Actions, Observations, and State.
-  - `client.py`: Typed client for interacting with the environment.
-- `inference.py`: Standard entrypoint for LLM-based agent evaluation.
-- `run_demo.py`: End-to-end script to run the server and the baseline agent.
-## 🚦 Quick Start
-### 1. Install Dependencies
 ```bash
-uv venv
-source .venv/bin/activate
-uv pip install -e .
 ```
-### 2. Run the Benchmark Demo
-This script launches the local server and executes the optimal "baseline" trajectory:
 ```bash
-python run_demo.py
 ```
-### 3. Run Tests
 ```bash
-pytest unified_incident_env/tests -q
 ```
-## 📊 Scoring Breakdown
-Success is measured across four primary dimensions:
-1.  **Recovery (45%):** Is the end-to-end system healthy and the cause removed?
-2.  **Security/Mitigation (35%):** Was the correct remediation target identified and fixed?
-3.  **Efficiency (10%):** Did the agent solve the incident within the tick budget without wasteful actions?
-4.  **Verification (10%):** Were all health checks passed before resolution?
-## 📝 License
-This project is licensed under the MIT License.

+---
+title: SRE Gym
+emoji: 🚨
+colorFrom: red
+colorTo: yellow
+sdk: docker
+app_port: 8000
+pinned: false
+license: apache-2.0
+---
+# sre-gym — Fault-injecting SRE training env for OpenEnv
+Most SRE agent skills are runbooks and good intentions. **sre-gym** is the other half: a fault-injecting environment with deterministic grading where an agent diagnoses a real production-style incident, chooses a safe remediation, verifies recovery, and declares resolved. Every run is scored the same way twice.
+- Spec-compliant OpenEnv environment (typed Pydantic action / observation / state, `reset` / `step` / `state`, `openenv validate` green).
+- 3 curriculum scenarios — easy, medium, hard — with decoy services and causal dependencies.
+- 11 bounded actions. Honest state transitions. No hidden oracles.
+- 21 tests passing.
+- Ships a Claude Code skill + verified-runbook loop — successful solves write markdown runbooks that the next run reads back.
+## 30-second demo
+```bash
+./demo/run_demo.sh
+```
+Starts the env, solves each scenario cold, writes a runbook for each, re-solves to prove the loop. Full transcript takes ~10 seconds.
+## Curriculum
+| Difficulty | Scenario | Story | Decoy | Correct path |
+|---|---|---|---|---|
+| easy | `worker_deploy_cascade` | Bad worker deploy → DB crash-loop → login 502s | — | rollback worker → restart db → verify → resolve |
+| medium | `db_config_rollout` | DB config push shrank connection pool from 80→12 | recent worker deploy | rollback **db** → restart db → verify → resolve |
+| hard | `gateway_auth_rollout` | Gateway auth-middleware rollout rejects valid logins | recent worker deploy | rollback **gateway** → verify → resolve (no restart) |
+Rolling back the wrong service returns a negative reward and `failure_type="wrong_remediation_target"`. Restarting before the cause is removed re-inherits the bad state. `declare_resolved` is rejected until the scenario's resolution check passes against the actual world model.
+## Install
 ```bash
+# 1. Create a venv and install
+python3 -m venv .venv && source .venv/bin/activate
+pip install -e '.[dev]'
+# 2. Start the env
+uvicorn server.app:app --host 127.0.0.1 --port 8000
+# 3. Run the baseline inference against it
+export HF_TOKEN="…"; export ENV_BASE_URL=http://127.0.0.1:8000
+python inference.py
 ```
+## Install the Claude Code skill
 ```bash
+ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"
+```
+Then, in Claude Code, ask: *"Solve the db_config_rollout scenario in sre-gym."* The skill will drive the env via `skill/tools/sre_gym_client.py`, load any existing runbook from `skill/verified-runbooks/`, and append a fresh runbook on any clean solve (score > 0.85).
+## Architecture
+```
+┌────────────────────┐      HTTP / WS       ┌──────────────────────┐
+│  Claude Code       │ ──────────────────▶ │  OpenEnv server       │
+│  (with sre-gym     │ ◀────────────────── │  (FastAPI, uvicorn)   │
+│   skill loaded)    │    obs, reward      │  unified_incident_env │
+└────────────────────┘                     └──────────────────────┘
+        │                                            ▲
+        ▼ on clean solve (score > 0.85)              │
+┌────────────────────┐                               │
+│ verified-runbooks/ │ ────── loaded at skill load ──┘
+│   *.md             │
+└────────────────────┘
 ```
+## Scoring
+Deterministic, 5 dimensions, sums to a public score in `[0.01, 0.99]`:
+- **Recovery** (0–0.4): critical-path services healthy
+- **Containment** (0–0.3): root cause removed or offending service isolated
+- **Verification** (0–0.35): `database_recovery` + `end_to_end` checks passed
+- **Impact** (0–0.15): user-impact reduced
+- **Efficiency** (0–0.10): budget preserved, no wasteful repeats
+Target **> 0.85** for "clean solve." That's also the runbook-record threshold.
+## Repo layout
+```
+unified_incident_env/    # env core: models, environment, grader, challenge, tests
+server/                  # OpenEnv entrypoint wrapper
+skill/                   # Claude Code skill: SKILL.md, tools/, verified-runbooks/
+demo/                    # run_demo.sh + pitch.md
+inference.py             # OpenAI-client baseline for OpenEnv hackathon submission
+openenv.yaml             # OpenEnv manifest
+Dockerfile               # HF Space deployment
+```
+## Verify
 ```bash
+pytest unified_incident_env/tests -q          # 21 tests
+python -m openenv.cli validate .              # OpenEnv manifest check
+docker build -t sre-engineer-llm:v2 .         # HF Space image
 ```
+## Roadmap — v2
+Distill the accumulated `verified-runbooks/` corpus into a local 3B reviewer via [OpenClaw-RL](https://github.com/Gen-Verse/OpenClaw-RL)'s async GRPO-on-next-state loop. Same reward contract (`run_check` passes / `failure_type` absent), same grader, but a compact policy that runs without a frontier API.
+## License
+Apache 2.0

demo/pitch.md ADDED Viewed

	@@ -0,0 +1,49 @@

+# sre-gym — 60-second pitch
+> You can't train SRE agents on production. We built the gym.
+## The story (00:00–01:00)
+**[0:00–0:10 · Hook]** "Most SRE agent skills are prompts — a runbook and a good intention. We built the other half: a fault-injecting environment with deterministic grading, where every run is scored the same way twice."
+**[0:10–0:25 · What it is]**
+- OpenEnv-compliant. `openenv validate` passes.
+- Three curriculum scenarios, easy → hard:
+  - **easy** `worker_deploy_cascade` — bad worker deploy cascades to a DB crash.
+  - **medium** `db_config_rollout` — DB config shrank the connection pool; a recent worker deploy is a decoy.
+  - **hard** `gateway_auth_rollout` — bad auth-middleware rollout; two plausible suspects, one right answer.
+- 11 bounded actions, honest state transitions (rolling back the wrong thing *fails*), deterministic grader across recovery / containment / verification / impact / efficiency.
+- 21 tests passing. One public Space URL.
+**[0:25–0:55 · Live demo]** `./demo/run_demo.sh`
+- Env starts. Three scenarios visible in `/tasks`.
+- Runbook dir cleared; demo starts cold.
+- Each scenario solves end-to-end (score ≈ 0.99, 8–10 steps).
+- A markdown runbook is written per scenario from the successful trace.
+- Re-solve the easy scenario — this time the skill loads the runbook first. Same score, same path, zero wasted investigation.
+- Point to `skill/verified-runbooks/` — "Every clean solve makes the next one deterministic. No GRPO required for v1."
+**[0:55–1:00 · Close]** "Install the skill by symlinking `skill/` into `~/.claude/skills/sre-gym`. Open source, Apache 2. v2 is the OpenClaw-RL loop — distill this corpus of verified runbooks into a local 3B reviewer."
+## The one technical claim you should be ready to defend
+> "The env is honest."
+- No hidden oracles. Rolling back the wrong service returns a negative reward and `failure_type="wrong_remediation_target"` — same observation contract as any other action.
+- `declare_resolved` is rejected until the scenario's `resolution_check` passes, verified by actual service states in the world model, not a flag the grader peeks at.
+- Rewards reward *effects*, not evidence-gathering — you can't farm the env by spamming `query_logs`.
+- `restart_service` on the database before the root cause is removed returns a negative reward. Always. Because in the real world, it would crash again.
+## Judge Q&A cheat sheet
+**"How is this different from running a real staging env?"**
+Deterministic scoring. Every agent gets graded against the same signatures, same decoys, same tick budget. You can't do that on real infra.
+**"Why only three scenarios?"**
+Three clears the hackathon DQ gate (`easy/medium/hard`). Each has a decoy + causal chain — building another one is a data-entry exercise, not a design one. Adding scenarios #4–#20 is the v2 data scaling lane.
+**"Why runbooks instead of GRPO?"**
+For this submission, GRPO means 48 hours of training convergence risk on top of an env we just shipped. Markdown runbooks demonstrate the same loop (verified signal → persisted artefact → next run improves) in an auditable form. The GRPO wiring slots on top of the same traces when we're ready.
+**"What's the skill actually doing at runtime?"**
+The skill lives in `skill/SKILL.md`. It directs Claude (or any agent) to read `verified-runbooks/{scenario}.md` before the first action, drive the env through `skill/tools/sre_gym_client.py`, and append a fresh runbook on any solve with `final_score > 0.85`.

demo/run_demo.sh ADDED Viewed

	@@ -0,0 +1,70 @@

+#!/usr/bin/env bash
+# sre-gym end-to-end demo.
+# Spins up the env (or reuses a running one), solves each of the 3 scenarios
+# with the baseline policy, records runbooks, shows the artefacts.
+#
+# Requires: python3.10+, docker (for the HF-Space-equivalent image) OR the
+# repo's .venv. Defaults to .venv if present.
+set -euo pipefail
+cd "$(dirname "$0")/.."
+PORT="${PORT:-8013}"
+URL="http://127.0.0.1:${PORT}"
+PY="${PYTHON:-.venv/bin/python}"
+RUNBOOK_DIR="skill/verified-runbooks"
+banner() { printf '\n\033[1;36m== %s ==\033[0m\n' "$*"; }
+ok()     { printf '\033[0;32m  ✓ %s\033[0m\n' "$*"; }
+banner "0 / preflight"
+if [[ ! -x "$PY" ]]; then
+  echo "  note: $PY not found, falling back to system python3" >&2
+  PY="python3"
+fi
+"$PY" -c "import unified_incident_env" 2>/dev/null || {
+  echo "  error: unified_incident_env not importable; run 'pip install -e .' first" >&2
+  exit 1
+}
+ok "python + package ready"
+banner "1 / start env"
+if curl -sf "$URL/health" > /dev/null 2>&1; then
+  ok "env already running on $URL"
+  SERVER_STARTED=0
+else
+  "$PY" -m uvicorn server.app:app --host 127.0.0.1 --port "$PORT" > /tmp/sre_gym_demo.log 2>&1 &
+  SERVER_PID=$!
+  SERVER_STARTED=1
+  for _ in $(seq 1 20); do
+    if curl -sf "$URL/health" > /dev/null 2>&1; then break; fi
+    sleep 0.3
+  done
+  curl -sf "$URL/health" > /dev/null || { echo "  error: env failed to start" >&2; cat /tmp/sre_gym_demo.log >&2; exit 1; }
+  ok "env started on $URL (pid $SERVER_PID)"
+fi
+trap '[[ ${SERVER_STARTED:-0} -eq 1 ]] && kill ${SERVER_PID:-0} 2>/dev/null || true' EXIT
+banner "2 / available scenarios"
+SRE_GYM_URL="$URL" "$PY" skill/tools/sre_gym_client.py list
+banner "3 / clear prior runbooks (demo starts cold)"
+rm -f "$RUNBOOK_DIR"/*.md
+ok "runbook directory cleared"
+for scenario in worker_deploy_cascade db_config_rollout gateway_auth_rollout; do
+  banner "4 / solve: $scenario"
+  SRE_GYM_URL="$URL" "$PY" skill/tools/sre_gym_client.py solve "$scenario"
+  SRE_GYM_URL="$URL" "$PY" skill/tools/sre_gym_client.py record-runbook "$scenario"
+done
+banner "5 / verified runbooks now on disk"
+ls -1 "$RUNBOOK_DIR"/*.md | sed 's|^|  |'
+banner "6 / re-solve easy scenario — runbook is loaded this time"
+SRE_GYM_URL="$URL" "$PY" skill/tools/sre_gym_client.py solve worker_deploy_cascade | tail -4
+banner "done"
+echo "  install the skill globally:   ln -s \"$PWD/skill\" \"\$HOME/.claude/skills/sre-gym\""
+echo "  env log:                      /tmp/sre_gym_demo.log"
+echo "  runbooks:                     $RUNBOOK_DIR/"

deploy/push_to_hf.sh ADDED Viewed

	@@ -0,0 +1,58 @@

+#!/usr/bin/env bash
+# Deploy this repo to a Hugging Face Space (Docker SDK).
+#
+# Required:
+#   HF_TOKEN      write-scoped HF access token
+#   HF_SPACE_ID   e.g. yourname/sre-gym  (create it at huggingface.co/new-space
+#                 first, SDK=Docker, or let this script try to create it)
+#
+# Usage:
+#   HF_TOKEN=hf_xxx HF_SPACE_ID=yourname/sre-gym ./deploy/push_to_hf.sh
+#
+# After a successful push, verify from a different network:
+#   curl https://${space_subdomain}.hf.space/health
+#   curl https://${space_subdomain}.hf.space/tasks | jq '.scenarios[].difficulty'
+set -euo pipefail
+cd "$(dirname "$0")/.."
+: "${HF_TOKEN:?HF_TOKEN is required}"
+: "${HF_SPACE_ID:?HF_SPACE_ID is required, e.g. yourname/sre-gym}"
+if ! command -v huggingface-cli > /dev/null; then
+  echo "error: huggingface-cli not installed. pip install 'huggingface_hub[cli]'" >&2
+  exit 1
+fi
+echo "== syncing openenv.yaml with HF_SPACE_ID =="
+python3 - <<PY
+import pathlib, re
+path = pathlib.Path("openenv.yaml")
+text = path.read_text()
+text = re.sub(r"^  space_id:.*$", f"  space_id: $HF_SPACE_ID", text, flags=re.M)
+path.write_text(text)
+print(f"openenv.yaml space_id -> $HF_SPACE_ID")
+PY
+echo "== ensuring the space exists (idempotent) =="
+huggingface-cli repo create "$HF_SPACE_ID" \
+  --type space \
+  --space_sdk docker \
+  --token "$HF_TOKEN" \
+  --yes 2>&1 | grep -v "already created" || true
+echo "== uploading repo =="
+huggingface-cli upload "$HF_SPACE_ID" . \
+  --repo-type space \
+  --token "$HF_TOKEN" \
+  --commit-message "deploy sre-gym v2 (easy/medium/hard scenarios)"
+subdomain="$(echo "$HF_SPACE_ID" | tr '/' '-')"
+echo
+echo "== deployment kicked off =="
+echo "   Logs:     https://huggingface.co/spaces/$HF_SPACE_ID"
+echo "   Public:   https://$subdomain.hf.space"
+echo
+echo "== verify from a different network (phone hotspot) =="
+echo "   curl https://$subdomain.hf.space/health"
+echo "   curl https://$subdomain.hf.space/tasks | jq '.scenarios[].difficulty'"

openclaw_integration/README.md ADDED Viewed

	@@ -0,0 +1,80 @@

+# OpenClaw-RL integration — sre-gym shim
+Plugs `sre-gym` into OpenClaw-RL's training loop without forking OpenClaw-RL.
+Three artifacts:
+- `pool_server.py` — FastAPI HTTP server speaking OpenClaw's lease-based
+  contract (`/allocate /reset /exec_tool /evaluate /close`). Wraps
+  `UnifiedIncidentEnvironment` behind per-lease `asyncio.Lock`s.
+- `sre_env_client.py` — Drop-in replacement for OpenClaw-RL
+  `terminal-rl/env_client.py`. Same method signatures.
+- `generate_with_sre.py` — Planned import-patch wrapper for
+  `terminal-rl/generate.py` (stub — filled in Friday when the OpenClaw-RL
+  venv is set up).
+## Quick start
+```bash
+# 1. Launch the pool server
+source .venv/bin/activate
+uvicorn openclaw_integration.pool_server:app --host 0.0.0.0 --port 8100
+# 2. Smoke-test the lifecycle from another shell
+curl -sf http://127.0.0.1:8100/healthz | jq
+curl -s -X POST http://127.0.0.1:8100/allocate \
+     -H 'content-type: application/json' \
+     -d '{"task_key": "gateway_auth_rollout"}'
+```
+## Wiring into OpenClaw-RL
+In the OpenClaw-RL repo, after creating a fresh venv per their instructions,
+point the rollout agent at our server:
+```bash
+export ENV_SERVER_URL=http://127.0.0.1:8100
+```
+Then patch one import in `OpenClaw-RL/terminal-rl/generate.py`:
+```diff
+- from env_client import create_env_client
++ import sys; sys.path.insert(0, "/path/to/sre-enginnerllm")
++ from openclaw_integration.sre_env_client import create_env_client
+```
+No other OpenClaw-RL source files need to change. The
+`run_qwen35_4b_openclaw_rl.sh` launch script works as-is after that.
+## Task keys (scenarios)
+- `worker_deploy_cascade` (easy)
+- `db_config_rollout` (medium)
+- `gateway_auth_rollout` (hard)
+## Lifecycle contract
+```
+allocate(task_key)                    -> {ok: true, lease_id}
+reset(lease_id, task_meta, run_ctx)   -> {ok: true, observation: "<json>"}
+exec_tool(lease_id, tool_call)        -> {ok: true, observation: "<json>"}
+evaluate(lease_id)                    -> {ok: true, score: float}
+close(lease_id)                       -> {ok: true}
+```
+- `task_meta.scenario_id` takes precedence over `task_key` at reset time if
+  set (useful for procgen Friday).
+- `tool_call.name` maps directly to `UnifiedIncidentAction.action_type`.
+- `tool_call.arguments` is the kwargs dict (service, metric, check_name,
+  hypothesis).
+- An invalid action is returned as an observation `{"error": "...",
+  "tool_call": {...}}` rather than raising — training gets the negative
+  signal without crashing the rollout.
+## Lease TTL / reaper
+- `POOL_SERVER_LEASE_TTL_S` (default 600s) — lease idle timeout.
+- `POOL_SERVER_REAPER_PERIOD` (default 30s) — reaper tick period.
+Reaper runs in lifespan background task; evicts idle leases so long
+training runs don't leak env instances.

openclaw_integration/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""OpenClaw-RL integration shim for sre-gym.
+This package exposes sre-gym through the lease-based HTTP contract used by
+OpenClaw-RL's `terminal-rl/` and `swe-rl/` training loops, so the existing
+OpenClaw-RL rollout+training scripts can target this env without code forks.
+"""

openclaw_integration/generate_with_sre.py ADDED Viewed

	@@ -0,0 +1,27 @@

+"""Import-patch adapter for OpenClaw-RL's `terminal-rl/generate.py`.
+STUB — filled in Friday when the OpenClaw-RL venv is set up.
+The shape is minimal: OpenClaw-RL's `terminal-rl/generate.py` does
+`from env_client import create_env_client`. All we need is to redirect that
+import to our client. Two options, pick one Friday:
+  Option A: monkey-patch via PYTHONPATH + shim module
+    export PYTHONPATH="/path/to/sre-enginnerllm:$PYTHONPATH"
+    mkdir -p /tmp/openclaw_shim && cd /tmp/openclaw_shim
+    cat > env_client.py <<'PY'
+    from openclaw_integration.sre_env_client import create_env_client
+    PY
+    export PYTHONPATH="/tmp/openclaw_shim:$PYTHONPATH"
+  Option B: patch generate.py directly
+    sed -i 's|from env_client import create_env_client|from openclaw_integration.sre_env_client import create_env_client|' \
+        /path/to/OpenClaw-RL/terminal-rl/generate.py
+Option A is reversible and cleaner. Option B is one line and survives a
+pip install -e.
+This file is intentionally empty beyond this docstring to keep the shim
+surface area tiny. When Friday work begins, the actual adapter (if any is
+needed beyond the import swap) lives here.
+"""

openclaw_integration/pool_server.py ADDED Viewed

	@@ -0,0 +1,273 @@

+"""FastAPI pool server exposing sre-gym in OpenClaw-RL's lease-based contract.
+OpenClaw-RL's rollout agent drives an env with this lifecycle per episode:
+    allocate(task_key)  -> {lease_id}
+    reset(lease_id, task_meta, run_ctx)
+    exec_tool(lease_id, tool_call)  -> observation_string   # repeated
+    evaluate(lease_id)              -> score
+    close(lease_id)
+We wrap a `UnifiedIncidentEnvironment` instance per lease. Lease state is
+guarded by per-lease `asyncio.Lock` so 8-way concurrent rollouts on the same
+server stay consistent. Idle leases are reaped after LEASE_TTL_S seconds.
+Run standalone:
+    uvicorn openclaw_integration.pool_server:app --host 0.0.0.0 --port 8100
+Env vars:
+    POOL_SERVER_LEASE_TTL_S   default 600
+    POOL_SERVER_REAPER_PERIOD default 30
+"""
+from __future__ import annotations
+import asyncio
+import json
+import logging
+import os
+import sys
+import time
+import uuid
+from contextlib import asynccontextmanager
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+from fastapi import FastAPI
+from pydantic import BaseModel, Field
+# Make the sibling package importable when launched via uvicorn from anywhere.
+_REPO_ROOT = Path(__file__).resolve().parent.parent
+if str(_REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT))
+from unified_incident_env.models import UnifiedIncidentAction  # noqa: E402
+from unified_incident_env.server.challenge import SCENARIOS  # noqa: E402
+from unified_incident_env.server.environment import UnifiedIncidentEnvironment  # noqa: E402
+logger = logging.getLogger("sre_gym.pool_server")
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
+LEASE_TTL_S = float(os.getenv("POOL_SERVER_LEASE_TTL_S", "600"))
+REAPER_PERIOD_S = float(os.getenv("POOL_SERVER_REAPER_PERIOD", "30"))
+@dataclass
+class Lease:
+    lease_id: str
+    task_key: str
+    env: UnifiedIncidentEnvironment
+    lock: asyncio.Lock = field(default_factory=asyncio.Lock)
+    last_touch: float = field(default_factory=time.time)
+    reset_done: bool = False
+    final_score: float | None = None
+    def touch(self) -> None:
+        self.last_touch = time.time()
+class AllocateRequest(BaseModel):
+    task_key: str
+    request_id: str | None = None
+class LeaseRequest(BaseModel):
+    lease_id: str
+class ResetRequest(BaseModel):
+    lease_id: str
+    task_meta: dict[str, Any] = Field(default_factory=dict)
+    run_ctx: dict[str, Any] = Field(default_factory=dict)
+    task_timeouts: dict[str, Any] | None = None
+class ToolCall(BaseModel):
+    name: str
+    arguments: dict[str, Any] = Field(default_factory=dict)
+class ExecToolRequest(BaseModel):
+    lease_id: str
+    tool_call: ToolCall
+class LeasePool:
+    def __init__(self) -> None:
+        self._leases: dict[str, Lease] = {}
+        self._dict_lock = asyncio.Lock()
+    async def allocate(self, task_key: str) -> Lease:
+        if task_key not in SCENARIOS:
+            raise ValueError(f"Unknown task_key {task_key!r}; known: {list(SCENARIOS)}")
+        env = UnifiedIncidentEnvironment()
+        lease = Lease(lease_id=str(uuid.uuid4()), task_key=task_key, env=env)
+        async with self._dict_lock:
+            self._leases[lease.lease_id] = lease
+        logger.info("allocate: lease=%s task=%s", lease.lease_id, task_key)
+        return lease
+    async def get(self, lease_id: str) -> Lease:
+        async with self._dict_lock:
+            lease = self._leases.get(lease_id)
+        if lease is None:
+            raise KeyError(f"Unknown lease {lease_id}")
+        lease.touch()
+        return lease
+    async def close(self, lease_id: str) -> bool:
+        async with self._dict_lock:
+            lease = self._leases.pop(lease_id, None)
+        if lease is None:
+            return False
+        logger.info("close: lease=%s task=%s", lease_id, lease.task_key)
+        return True
+    async def reap(self) -> int:
+        now = time.time()
+        stale: list[str] = []
+        async with self._dict_lock:
+            for lease_id, lease in list(self._leases.items()):
+                if now - lease.last_touch > LEASE_TTL_S:
+                    stale.append(lease_id)
+            for lease_id in stale:
+                self._leases.pop(lease_id, None)
+        if stale:
+            logger.info("reaper: evicted %d stale lease(s)", len(stale))
+        return len(stale)
+    def active_count(self) -> int:
+        return len(self._leases)
+pool = LeasePool()
+async def _reaper_loop() -> None:
+    while True:
+        try:
+            await pool.reap()
+        except Exception:
+            logger.exception("reaper loop tick failed")
+        await asyncio.sleep(REAPER_PERIOD_S)
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    task = asyncio.create_task(_reaper_loop())
+    try:
+        yield
+    finally:
+        task.cancel()
+        try:
+            await task
+        except asyncio.CancelledError:
+            pass
+app = FastAPI(title="sre-gym OpenClaw pool server", lifespan=lifespan)
+def _observation_string(obs: Any, *, reward: float | None = None) -> str:
+    """Render a UnifiedIncidentObservation as the single string OpenClaw
+    rollout agents expect from exec_tool."""
+    payload = {
+        "tick": obs.tick_count,
+        "workflow_stage": obs.workflow_stage,
+        "last_action_result": obs.last_action_result,
+        "tool_output": obs.tool_output,
+        "failure_type": obs.failure_type,
+        "why_failed": obs.why_failed,
+        "loop_warning": obs.loop_warning,
+        "reward": reward,
+        "checks": [{"name": c.name, "passed": c.passed} for c in obs.checks],
+        "active_alerts": [{"service": a.service, "severity": a.severity, "message": a.message} for a in obs.active_alerts],
+        "noise_alerts": [{"service": a.service, "severity": a.severity, "message": a.message} for a in obs.noise_alerts],
+        "service_health": {name: s.status for name, s in obs.service_health.items()},
+        "allowed_actions": obs.allowed_actions,
+        "required_fields_by_action": obs.required_fields_by_action,
+        "blast_radius": obs.blast_radius,
+        "final_score": obs.final_score,
+        "done": obs.done,
+        "prompt_text": obs.prompt_text,
+    }
+    return json.dumps(payload, separators=(",", ":"))
+@app.get("/healthz")
+async def healthz() -> dict[str, Any]:
+    return {"ok": True, "active_leases": pool.active_count(), "scenarios": list(SCENARIOS.keys())}
+@app.post("/allocate")
+async def allocate(request: AllocateRequest) -> dict[str, Any]:
+    try:
+        lease = await pool.allocate(request.task_key)
+    except ValueError as exc:
+        return {"ok": False, "error": str(exc)}
+    return {"ok": True, "lease_id": lease.lease_id, "task_key": lease.task_key, "request_id": request.request_id}
+@app.post("/heartbeat")
+async def heartbeat(request: LeaseRequest) -> dict[str, Any]:
+    try:
+        await pool.get(request.lease_id)
+    except KeyError as exc:
+        return {"ok": False, "error": str(exc)}
+    return {"ok": True}
+@app.post("/reset")
+async def reset(request: ResetRequest) -> dict[str, Any]:
+    try:
+        lease = await pool.get(request.lease_id)
+    except KeyError as exc:
+        return {"ok": False, "error": str(exc)}
+    async with lease.lock:
+        scenario_id = request.task_meta.get("scenario_id") or lease.task_key
+        obs = lease.env.reset(scenario_id=scenario_id)
+        lease.reset_done = True
+        lease.final_score = None
+    return {"ok": True, "observation": _observation_string(obs)}
+@app.post("/exec_tool")
+async def exec_tool(request: ExecToolRequest) -> dict[str, Any]:
+    try:
+        lease = await pool.get(request.lease_id)
+    except KeyError as exc:
+        return {"ok": False, "error": str(exc)}
+    if not lease.reset_done:
+        return {"ok": False, "error": "reset has not been called for this lease"}
+    action_kwargs = {"action_type": request.tool_call.name, **request.tool_call.arguments}
+    try:
+        action = UnifiedIncidentAction(**action_kwargs)
+    except Exception as exc:
+        # Return the validation error to the rollout agent as a no-op
+        # observation so training sees the failure signal without crashing.
+        return {"ok": True, "observation": json.dumps({"error": f"invalid action: {exc}", "tool_call": request.tool_call.model_dump()})}
+    async with lease.lock:
+        obs = lease.env.step(action)
+        lease.final_score = float(obs.final_score)
+    return {"ok": True, "observation": _observation_string(obs, reward=float(obs.reward))}
+@app.post("/evaluate")
+async def evaluate(request: LeaseRequest) -> dict[str, Any]:
+    try:
+        lease = await pool.get(request.lease_id)
+    except KeyError as exc:
+        return {"ok": False, "error": str(exc)}
+    score = lease.final_score if lease.final_score is not None else float(lease.env.state.final_score)
+    return {"ok": True, "score": score}
+@app.post("/close")
+async def close(request: LeaseRequest) -> dict[str, Any]:
+    closed = await pool.close(request.lease_id)
+    if not closed:
+        return {"ok": False, "error": f"Unknown lease {request.lease_id}"}
+    return {"ok": True}

openclaw_integration/sre_env_client.py ADDED Viewed

	@@ -0,0 +1,159 @@

+"""Drop-in replacement for OpenClaw-RL `terminal-rl/env_client.py`.
+Interface matches `TerminalEnvClient` (allocate / heartbeat / reset /
+exec_tool / evaluate / close) so OpenClaw-RL's rollout agent can swap imports
+with one line.
+Standalone (no slime dep) — uses httpx directly. To use slime's retrying
+post() helper instead, replace `_post` with `slime.utils.http_utils.post`.
+Env vars:
+    ENV_SERVER_URL                   required, e.g. http://127.0.0.1:8100
+    ENV_HTTP_MAX_RETRIES             default 10
+    ENV_ALLOCATE_MAX_RETRIES         default 10
+    ENV_EVALUATE_MAX_RETRIES         default 1
+    ENV_CLOSE_MAX_RETRIES            default 3
+    ENV_EXEC_TOOL_MAX_RETRIES        default 3
+    ENV_HTTP_TIMEOUT_S               default 30
+"""
+from __future__ import annotations
+import asyncio
+import logging
+import os
+from typing import Any
+import httpx
+logger = logging.getLogger(__name__)
+def create_env_client() -> "SreEnvClient":
+    env_server_url = os.getenv("ENV_SERVER_URL", "")
+    if not env_server_url:
+        raise RuntimeError("ENV_SERVER_URL is empty.")
+    return SreEnvClient(env_server_url)
+async def _post(
+    url: str,
+    payload: dict[str, Any],
+    *,
+    max_retries: int,
+    timeout_s: float,
+) -> dict[str, Any]:
+    last_exc: Exception | None = None
+    for attempt in range(max_retries):
+        try:
+            async with httpx.AsyncClient(timeout=timeout_s) as client:
+                response = await client.post(url, json=payload)
+                response.raise_for_status()
+                return response.json()
+        except Exception as exc:  # retry all transport errors
+            last_exc = exc
+            wait = min(2 ** attempt * 0.25, 5.0)
+            logger.debug("POST %s failed (attempt %d/%d): %s", url, attempt + 1, max_retries, exc)
+            await asyncio.sleep(wait)
+    raise RuntimeError(f"POST {url} failed after {max_retries} retries: {last_exc}")
+class SreEnvClient:
+    """OpenClaw-RL-shaped client for the sre-gym pool server."""
+    def __init__(self, base_url: str) -> None:
+        self.base_url = base_url.rstrip("/")
+        self.default_max_retries = int(os.getenv("ENV_HTTP_MAX_RETRIES", "10"))
+        self.allocate_max_retries = int(os.getenv("ENV_ALLOCATE_MAX_RETRIES", "10"))
+        self.evaluate_max_retries = int(os.getenv("ENV_EVALUATE_MAX_RETRIES", "1"))
+        self.close_max_retries = int(os.getenv("ENV_CLOSE_MAX_RETRIES", "3"))
+        self.exec_tool_max_retries = int(os.getenv("ENV_EXEC_TOOL_MAX_RETRIES", "3"))
+        self.timeout_s = float(os.getenv("ENV_HTTP_TIMEOUT_S", "30"))
+    async def allocate(self, task_key: str, request_id: str | None = None) -> dict[str, Any]:
+        out = await _post(
+            f"{self.base_url}/allocate",
+            {"task_key": task_key, "request_id": request_id},
+            max_retries=self.allocate_max_retries,
+            timeout_s=self.timeout_s,
+        )
+        if not out.get("ok", False):
+            raise RuntimeError(f"allocate failed: {out}")
+        return out
+    async def heartbeat(self, lease_id: str) -> None:
+        out = await _post(
+            f"{self.base_url}/heartbeat",
+            {"lease_id": lease_id},
+            max_retries=self.default_max_retries,
+            timeout_s=self.timeout_s,
+        )
+        if not out.get("ok", False):
+            raise RuntimeError(f"heartbeat failed: {out}")
+    async def reset(
+        self,
+        lease_id: str,
+        task_meta: dict[str, Any],
+        run_ctx: dict[str, Any],
+        task_timeouts: dict[str, Any] | None = None,
+    ) -> dict[str, Any]:
+        out = await _post(
+            f"{self.base_url}/reset",
+            {
+                "lease_id": lease_id,
+                "task_meta": task_meta,
+                "run_ctx": run_ctx,
+                "task_timeouts": task_timeouts,
+            },
+            max_retries=self.default_max_retries,
+            timeout_s=self.timeout_s,
+        )
+        if not out.get("ok", False):
+            raise RuntimeError(f"reset failed: {out}")
+        return out
+    async def exec_tool(self, lease_id: str, tool_name: str, arguments: dict[str, Any]) -> str:
+        out = await _post(
+            f"{self.base_url}/exec_tool",
+            {
+                "lease_id": lease_id,
+                "tool_call": {"name": tool_name, "arguments": arguments},
+            },
+            max_retries=self.exec_tool_max_retries,
+            timeout_s=self.timeout_s,
+        )
+        if not out.get("ok", False):
+            raise RuntimeError(f"exec_tool failed: {out}")
+        return str(out.get("observation", ""))
+    async def evaluate(self, lease_id: str) -> float:
+        out = await _post(
+            f"{self.base_url}/evaluate",
+            {"lease_id": lease_id},
+            max_retries=self.evaluate_max_retries,
+            timeout_s=self.timeout_s,
+        )
+        if not out.get("ok", False):
+            raise RuntimeError(f"evaluate failed: {out}")
+        return float(out.get("score", 0.0))
+    async def close(self, lease_id: str) -> None:
+        try:
+            out = await _post(
+                f"{self.base_url}/close",
+                {"lease_id": lease_id},
+                max_retries=self.close_max_retries,
+                timeout_s=self.timeout_s,
+            )
+        except Exception as exc:
+            if "Unknown lease" in str(exc):
+                logger.debug("close(%s): lease already gone", lease_id)
+                return
+            raise
+        if not out.get("ok", False):
+            error_msg = str(out.get("error", ""))
+            if "Unknown" in error_msg and "lease" in error_msg.lower():
+                logger.debug("close(%s): lease already gone", lease_id)
+                return
+            raise RuntimeError(f"close failed: {out}")

openenv.yaml CHANGED Viewed

@@ -12,10 +12,10 @@ environment:
   observation_type: UnifiedIncidentObservation
   state_type: UnifiedIncidentState
   max_steps: 12
-  difficulties: [easy]
   reward_type: dense
 huggingface:
-  space_id: gylder/my-env
   sdk: docker
   hardware: cpu-basic

   observation_type: UnifiedIncidentObservation
   state_type: UnifiedIncidentState
   max_steps: 12
+  difficulties: [easy, medium, hard]
   reward_type: dense
 huggingface:
+  space_id: dakshdoesdev/sre-gym
   sdk: docker
   hardware: cpu-basic

skill/SKILL.md ADDED Viewed

	@@ -0,0 +1,100 @@

+---
+name: sre-gym
+description: SRE incident-response training environment with fault injection and deterministic grading. Use when the user wants to practice SRE skills, solve an injected production incident, or run one of three scenarios (worker_deploy_cascade / db_config_rollout / gateway_auth_rollout) against the sre-gym HTTP server. Invokes scripts in skill/tools/ to query the env and records verified runbooks after clean solves.
+---
+# SRE Gym — Incident Response Skill
+You are an SRE agent connected to a running sre-gym environment (HTTP, default `http://127.0.0.1:8000`). The env simulates production incidents with decoy services, deterministic grading, and explicit resolution checks. Your job is to diagnose from evidence, pick the correct remediation, verify recovery, then declare resolved.
+## When to use this skill
+- The user names a scenario (`worker_deploy_cascade`, `db_config_rollout`, `gateway_auth_rollout`) or says "solve an incident / run SRE scenario"
+- The user asks you to practice, benchmark, or demo incident response
+- The user points you at an sre-gym URL
+## Core rules (never break these)
+1. **Never guess at remediation.** Query evidence (`query_logs`, `query_deploys`, `query_metrics`) before `rollback_deploy` / `restart_service`.
+2. **Root cause before restart.** Restarting a service before rolling back the triggering change re-inherits the bad state.
+3. **Never call `declare_resolved` before the scenario's resolution check passes.** Each scenario specifies which check is required; read it from `observation.checks` and from any loaded runbook.
+4. **Watch for decoys.** Each scenario has a plausible-looking wrong answer. Example: `db_config_rollout` has a recent worker deploy that is *not* the cause. Read logs before committing to a target.
+5. **Repeating the same no-progress action wastes ticks.** The env emits `loop_warning` when you do this — treat it as a hard signal to try a different evidence source.
+## Workflow
+### 1. Load prior knowledge
+Before your first action, check `skill/verified-runbooks/{scenario_id}.md`. If it exists, read it — it's a log of previously-successful solves for this exact scenario, written by earlier runs of this skill. Use the winning path and the decoy list.
+### 2. Drive the env
+Use `skill/tools/sre_gym_client.py` to call the env:
+```bash
+python skill/tools/sre_gym_client.py list           # show available scenarios
+python skill/tools/sre_gym_client.py reset <id>     # start an episode
+python skill/tools/sre_gym_client.py step '<json>'  # take one action
+python skill/tools/sre_gym_client.py status         # current obs + grader
+```
+Action JSON matches the env's `UnifiedIncidentAction` model. Examples:
+```json
+{"action_type": "query_logs", "service": "database"}
+{"action_type": "query_deploys", "service": "worker"}
+{"action_type": "rollback_deploy", "service": "database"}
+{"action_type": "run_check", "check_name": "end_to_end"}
+{"action_type": "declare_resolved"}
+```
+### 3. Investigation loop (per tick)
+1. Read `observation.prompt_text` — services, alerts, last result, failure_type, why_failed.
+2. If `observation.failure_type` is set, your previous action was rejected — **do not repeat it**, read `why_failed` and pick a different evidence source or remediation.
+3. Form a hypothesis with `submit_hypothesis` once you have enough evidence (usually 2–4 queries). Calibrate `confidence`: ≥0.7 only if you're sure.
+4. Remediate (`rollback_deploy` → `restart_service` if scenario requires → `run_check`).
+5. `declare_resolved` only after the required check passes.
+### 4. Record the runbook
+If the episode finishes with `incident_resolved=true` and `final_score > 0.85`, run:
+```bash
+python skill/tools/sre_gym_client.py record-runbook <scenario_id>
+```
+This appends a new entry to `skill/verified-runbooks/{scenario_id}.md`. Future runs of this skill (yours or another Claude's) load it automatically.
+## Action reference (11 actions)
+| Action | Required fields | Purpose |
+|---|---|---|
+| `query_logs` | `service` | Read service-level error logs |
+| `query_metrics` | `service`, `metric` (cpu/error_rate/latency) | Read quantitative signals |
+| `query_dependencies` | `service` | Map upstream/downstream |
+| `query_deploys` | `service` | Recent deploy history |
+| `rollback_deploy` | `service` | Revert last deploy — SCENARIO-SPECIFIC TARGET |
+| `restart_service` | `service` | Reboot a service (usually after rollback) |
+| `run_check` | `check_name` (`database_recovery` / `end_to_end`) | Objective recovery check |
+| `isolate_service` | `service` | Containment only, does not resolve |
+| `escalate` | — | Record escalation note |
+| `submit_hypothesis` | `hypothesis` object | Commit RCA with confidence calibration |
+| `declare_resolved` | — | Finalize; rejected if required check has not passed |
+## Scoring rubric (deterministic from the env)
+- **Recovery (0–0.4):** services healthy on the critical path
+- **Containment (0–0.3):** root cause removed OR offending service isolated
+- **Verification (0–0.35):** both checks passed
+- **Impact (0–0.15):** user_impact reduced
+- **Efficiency (0–0.10):** budget preserved, no wasteful repeats
+Clean solve target: **> 0.85**. That's the runbook-record threshold.
+## Decoy knowledge (read before hypothesizing)
+- `worker_deploy_cascade`: the only true cause; no decoys.
+- `db_config_rollout`: the recent worker deploy is a **decoy**. Rolling back worker yields `wrong_remediation_target`.
+- `gateway_auth_rollout`: the recent worker deploy (`worker@...-hotfix` — log-format tweak) is a **decoy**. The gateway auth rollout is the cause.
+If you take a wrong remediation, the env returns `failure_type="wrong_remediation_target"` and a negative reward — **do not retry the same wrong target**, re-read the logs.

skill/tools/sre_gym_client.py ADDED Viewed

	@@ -0,0 +1,238 @@

+#!/usr/bin/env python3
+"""CLI client for the sre-gym skill.
+Usage:
+    sre_gym_client.py list
+    sre_gym_client.py solve <scenario_id> [--policy baseline]
+    sre_gym_client.py interactive <scenario_id>   # stdin: one JSON action per line
+    sre_gym_client.py record-runbook <scenario_id> <session.json>
+Because OpenEnv's HTTP /reset and /step handlers create a fresh environment per
+call, episode state only persists within a single client session. This CLI wraps
+one episode inside one Python process so the session is preserved.
+SRE_GYM_URL env var overrides the base URL (default http://127.0.0.1:8000).
+"""
+from __future__ import annotations
+import datetime as _dt
+import json
+import os
+import sys
+from pathlib import Path
+from typing import Any
+# Make the sibling package importable whether the script is invoked from the
+# repo root or from the skill/ directory directly.
+_REPO_ROOT = Path(__file__).resolve().parent.parent.parent
+if str(_REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT))
+from unified_incident_env.client import UnifiedIncidentEnv  # noqa: E402
+from unified_incident_env.models import UnifiedIncidentAction, UnifiedIncidentObservation  # noqa: E402
+from unified_incident_env.server.challenge import SCENARIOS, list_baselines  # noqa: E402
+BASE_URL = os.environ.get("SRE_GYM_URL", "http://127.0.0.1:8000").rstrip("/")
+RUNBOOK_DIR = Path(__file__).resolve().parent.parent / "verified-runbooks"
+SCORE_THRESHOLD = 0.85
+def _clean_action(action: UnifiedIncidentAction) -> dict[str, Any]:
+    data = action.model_dump(exclude_none=True)
+    if data.get("metadata") == {}:
+        data.pop("metadata")
+    hypothesis = data.get("hypothesis")
+    if isinstance(hypothesis, dict) and hypothesis.get("metadata") == {}:
+        hypothesis.pop("metadata", None)
+    return data
+def _summarize_obs(obs: UnifiedIncidentObservation) -> dict[str, Any]:
+    return {
+        "tick": obs.tick_count,
+        "workflow_stage": obs.workflow_stage,
+        "last_action_result": obs.last_action_result,
+        "tool_output": obs.tool_output,
+        "failure_type": obs.failure_type,
+        "why_failed": obs.why_failed,
+        "loop_warning": obs.loop_warning,
+        "checks": [{"name": c.name, "passed": c.passed} for c in obs.checks],
+        "final_score": obs.final_score,
+        "incident_resolved": obs.incident_resolved,
+    }
+def _session_path(scenario_id: str) -> Path:
+    return Path(f"/tmp/sre_gym_session.{scenario_id}.json")
+def cmd_list() -> None:
+    for scenario in SCENARIOS.values():
+        print(f"  {scenario['difficulty']:<6} {scenario['id']:<25} {scenario['name']}")
+def cmd_solve(scenario_id: str, policy: str = "baseline") -> None:
+    """Run an entire episode end-to-end inside one process."""
+    if scenario_id not in SCENARIOS:
+        print(f"error: unknown scenario {scenario_id!r}", file=sys.stderr)
+        sys.exit(2)
+    if policy != "baseline":
+        print(f"error: unknown policy {policy!r} (only 'baseline' available)", file=sys.stderr)
+        sys.exit(2)
+    trace: list[dict[str, Any]] = []
+    with UnifiedIncidentEnv(base_url=BASE_URL).sync() as env:
+        obs = env.reset(scenario_id=scenario_id).observation
+        print(f"[reset] scenario={scenario_id} difficulty={obs.difficulty}")
+        for step in list_baselines(scenario_id).baselines[0].actions:
+            result = env.step(step.action)
+            obs = result.observation
+            record = {
+                "step": obs.tick_count,
+                "action": _clean_action(step.action),
+                "rationale": step.rationale,
+                "reward": result.reward,
+                **_summarize_obs(obs),
+            }
+            trace.append(record)
+            action_repr = json.dumps(record["action"], separators=(",", ":"))
+            print(f"[step {obs.tick_count}] action={action_repr} reward={result.reward:+.2f} score={obs.final_score:.2f}")
+            if result.done:
+                break
+        final = _summarize_obs(obs)
+    _session_path(scenario_id).write_text(
+        json.dumps({"scenario_id": scenario_id, "trace": trace, "final": final}, indent=2),
+        encoding="utf-8",
+    )
+    print(
+        f"[done] resolved={final['incident_resolved']} score={final['final_score']:.2f} "
+        f"steps={final['tick']} session={_session_path(scenario_id)}"
+    )
+def cmd_interactive(scenario_id: str) -> None:
+    """One JSON action per stdin line. Preserves session for the whole process lifetime."""
+    if scenario_id not in SCENARIOS:
+        print(f"error: unknown scenario {scenario_id!r}", file=sys.stderr)
+        sys.exit(2)
+    trace: list[dict[str, Any]] = []
+    with UnifiedIncidentEnv(base_url=BASE_URL).sync() as env:
+        obs = env.reset(scenario_id=scenario_id).observation
+        print(json.dumps({"event": "reset", "scenario_id": scenario_id, "obs": _summarize_obs(obs)}), flush=True)
+        for line in sys.stdin:
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                data = json.loads(line)
+                action = UnifiedIncidentAction(**data)
+            except Exception as exc:
+                print(json.dumps({"event": "error", "detail": str(exc)}), flush=True)
+                continue
+            result = env.step(action)
+            obs = result.observation
+            record = {"step": obs.tick_count, "action": _clean_action(action), "reward": result.reward, **_summarize_obs(obs)}
+            trace.append(record)
+            print(json.dumps({"event": "step", **record}), flush=True)
+            if result.done:
+                print(json.dumps({"event": "done", "final": _summarize_obs(obs)}), flush=True)
+                break
+    _session_path(scenario_id).write_text(
+        json.dumps({"scenario_id": scenario_id, "trace": trace, "final": _summarize_obs(obs)}, indent=2),
+        encoding="utf-8",
+    )
+def cmd_record_runbook(scenario_id: str, session_file: str | None = None) -> None:
+    """Append a new runbook entry if the referenced session cleared the threshold."""
+    path = Path(session_file) if session_file else _session_path(scenario_id)
+    if not path.exists():
+        print(f"error: no session file at {path}", file=sys.stderr)
+        sys.exit(2)
+    session = json.loads(path.read_text(encoding="utf-8"))
+    final = session.get("final", {})
+    score = float(final.get("final_score", 0.0))
+    if not final.get("incident_resolved"):
+        print(f"skip: session not resolved (resolved={final.get('incident_resolved')})", file=sys.stderr)
+        sys.exit(1)
+    if score < SCORE_THRESHOLD:
+        print(f"skip: score {score:.2f} below runbook threshold {SCORE_THRESHOLD:.2f}", file=sys.stderr)
+        sys.exit(1)
+    RUNBOOK_DIR.mkdir(parents=True, exist_ok=True)
+    runbook_path = RUNBOOK_DIR / f"{scenario_id}.md"
+    timestamp = _dt.datetime.now(_dt.timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
+    steps = int(final.get("tick", 0))
+    checks_passed = [c["name"] for c in final.get("checks", []) if c.get("passed")]
+    trace = session.get("trace", [])
+    header = (
+        f"# verified-runbooks/{scenario_id}.md\n\n"
+        "Runbook entries are written by the sre-gym skill after a successful solve "
+        f"(incident_resolved=true and final_score > {SCORE_THRESHOLD:.2f}).\n"
+        "Each entry is immutable evidence — treat it as ground truth for the winning path.\n\n---\n"
+    )
+    lines = [f"\n## Run {timestamp} — Score {score:.2f}\n"]
+    lines.append(f"- Steps: **{steps}**")
+    lines.append(f"- Checks passed: {', '.join(checks_passed) or 'none'}")
+    lines.append("")
+    lines.append("**Winning path:**")
+    for entry in trace:
+        act = entry["action"]
+        action_type = act.get("action_type")
+        extras = ", ".join(
+            f"{k}={v if not isinstance(v, dict) else v.get('root_cause', v)}"
+            for k, v in act.items()
+            if k != "action_type" and v not in (None, {})
+        )
+        extra_str = f" ({extras})" if extras else ""
+        rationale = entry.get("rationale", "").rstrip(".")
+        lines.append(f"{entry['step']}. `{action_type}{extra_str}` — {rationale}")
+    lines.append("")
+    entry_text = "\n".join(lines)
+    if not runbook_path.exists():
+        runbook_path.write_text(header + entry_text, encoding="utf-8")
+    else:
+        with runbook_path.open("a", encoding="utf-8") as f:
+            f.write(entry_text)
+    print(f"recorded runbook entry → {runbook_path} (score {score:.2f}, {steps} steps)")
+def main() -> None:
+    argv = sys.argv[1:]
+    if not argv:
+        print(__doc__, file=sys.stderr)
+        sys.exit(2)
+    cmd, *rest = argv
+    if cmd == "list":
+        cmd_list()
+    elif cmd == "solve":
+        if not rest:
+            print("error: solve requires <scenario_id>", file=sys.stderr)
+            sys.exit(2)
+        cmd_solve(rest[0], rest[1] if len(rest) > 1 else "baseline")
+    elif cmd == "interactive":
+        if not rest:
+            print("error: interactive requires <scenario_id>", file=sys.stderr)
+            sys.exit(2)
+        cmd_interactive(rest[0])
+    elif cmd == "record-runbook":
+        if not rest:
+            print("error: record-runbook requires <scenario_id>", file=sys.stderr)
+            sys.exit(2)
+        cmd_record_runbook(rest[0], rest[1] if len(rest) > 1 else None)
+    else:
+        print(f"error: unknown command {cmd!r}", file=sys.stderr)
+        print(__doc__, file=sys.stderr)
+        sys.exit(2)
+if __name__ == "__main__":
+    main()

skill/verified-runbooks/.gitkeep ADDED Viewed

File without changes

skill/verified-runbooks/db_config_rollout.md ADDED Viewed

	@@ -0,0 +1,23 @@

+# verified-runbooks/db_config_rollout.md
+Runbook entries are written by the sre-gym skill after a successful solve (incident_resolved=true and final_score > 0.85).
+Each entry is immutable evidence — treat it as ground truth for the winning path.
+---
+## Run 2026-04-23T22:01:33Z — Score 0.99
+- Steps: **10**
+- Checks passed: database_recovery, end_to_end
+**Winning path:**
+1. `query_logs (service=database)` — Database is the loudest alert; inspect logs for the actual error signature
+2. `query_deploys (service=database)` — Pool-acquire errors suggest a config change; check recent database rollouts
+3. `query_metrics (service=database, metric=error_rate)` — Confirm the error pattern is pool exhaustion rather than compute overload
+4. `query_logs (service=worker)` — Rule out the decoy worker deploy by reading worker logs directly
+5. `submit_hypothesis (hypothesis=database_only_failure)` — Localize the fault to the database config before remediating
+6. `rollback_deploy (service=database)` — Roll back the offending database config rollout
+7. `restart_service (service=database)` — Restart the database cleanly against the restored pool config
+8. `run_check (check_name=database_recovery)` — Verify database pool health and write latency are back within SLO
+9. `run_check (check_name=end_to_end)` — Verify gateway write-path traffic succeeds end-to-end
+10. `declare_resolved` — Declare resolved only after objective checks pass

skill/verified-runbooks/gateway_auth_rollout.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# verified-runbooks/gateway_auth_rollout.md
+Runbook entries are written by the sre-gym skill after a successful solve (incident_resolved=true and final_score > 0.85).
+Each entry is immutable evidence — treat it as ground truth for the winning path.
+---
+## Run 2026-04-23T22:01:37Z — Score 0.99
+- Steps: **8**
+- Checks passed: database_recovery, end_to_end
+**Winning path:**
+1. `query_logs (service=api-gateway)` — Gateway is rejecting logins; read gateway logs to localize the rejection class
+2. `query_deploys (service=api-gateway)` — Login rejection aligns with a recent auth middleware rollout; confirm deploy timing
+3. `query_deploys (service=worker)` — Rule out the worker deploy explicitly rather than assuming
+4. `submit_hypothesis (hypothesis=api_gateway_fault)` — Commit a calibrated hypothesis localizing to the gateway auth rollout
+5. `rollback_deploy (service=api-gateway)` — Roll back the bad auth middleware rollout; no restart needed
+6. `run_check (check_name=end_to_end)` — Verify that gateway login traffic now succeeds end-to-end
+7. `run_check (check_name=database_recovery)` — Confirm the database is (and stayed) healthy throughout
+8. `declare_resolved` — Declare resolved only after objective checks pass

skill/verified-runbooks/worker_deploy_cascade.md ADDED Viewed

	@@ -0,0 +1,23 @@

+# verified-runbooks/worker_deploy_cascade.md
+Runbook entries are written by the sre-gym skill after a successful solve (incident_resolved=true and final_score > 0.85).
+Each entry is immutable evidence — treat it as ground truth for the winning path.
+---
+## Run 2026-04-23T22:01:29Z — Score 0.99
+- Steps: **10**
+- Checks passed: database_recovery, end_to_end
+**Winning path:**
+1. `query_deploys (service=worker)` — Check whether any recent deploy aligns with the incident start
+2. `query_logs (service=worker)` — Inspect worker logs because deploy timing and queue pressure suggest worker-originated harm
+3. `query_metrics (service=database, metric=cpu)` — Confirm that the database is overloaded as a downstream effect
+4. `query_dependencies (service=api-gateway)` — Verify the gateway depends on the worker and database path
+5. `submit_hypothesis (hypothesis=bad_worker_deploy)` — Commit a calibrated hypothesis before taking an invasive mitigation step
+6. `rollback_deploy (service=worker)` — Remove the triggering change before restarting downstream services
+7. `restart_service (service=database)` — Bring the database back cleanly after the root cause is removed
+8. `run_check (check_name=database_recovery)` — Verify the database is no longer crashing
+9. `run_check (check_name=end_to_end)` — Verify gateway traffic succeeds end-to-end
+10. `declare_resolved` — Declare resolved only after objective checks pass

train/collect_trajectories.py ADDED Viewed

	@@ -0,0 +1,471 @@

+"""Parallel async harness for collecting Claude-driven sre-gym trajectories.
+Example:
+    python train/collect_trajectories.py \
+        --env-url https://dakshdoesdev-sre-gym.hf.space \
+        --scenarios worker_deploy_cascade,db_config_rollout,gateway_auth_rollout \
+        --models claude-sonnet-4-6,claude-haiku-4-5-20251001 \
+        --episodes-per-model 1000 \
+        --parallelism 20 \
+        --output data/trajectories.jsonl
+`--episodes-per-model` is total episodes per model across the resolved scenario
+set. Scenario assignment is round-robin so every requested scenario receives
+coverage over a long run.
+"""
+from __future__ import annotations
+import argparse
+import asyncio
+import json
+import os
+import sys
+import time
+import uuid
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+import httpx
+REPO_ROOT = Path(__file__).resolve().parents[1]
+if str(REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(REPO_ROOT))
+try:
+    from anthropic import AsyncAnthropic
+except ImportError:  # pragma: no cover - handled at runtime in anthropic mode
+    AsyncAnthropic = None  # type: ignore[assignment]
+from unified_incident_env.client import UnifiedIncidentEnv
+from unified_incident_env.models import UnifiedIncidentAction, UnifiedIncidentObservation
+from unified_incident_env.server.challenge import SCENARIOS, SUPPORTED_DIFFICULTIES
+SYSTEM_PROMPT = (
+    "You are collecting trajectories for a deterministic SRE incident benchmark.\n"
+    "Return exactly one JSON object and nothing else.\n"
+    "Choose only from the allowed action types shown in the prompt.\n"
+    "Use only the required fields for the chosen action.\n"
+    "Do not include markdown, prose, or code fences."
+)
+METRIC_OPTIONS = ("cpu", "error_rate", "latency")
+CHECK_OPTIONS = ("database_recovery", "end_to_end")
+ROOT_CAUSE_OPTIONS = (
+    "bad_worker_deploy",
+    "database_only_failure",
+    "api_gateway_fault",
+)
+@dataclass(frozen=True)
+class EpisodeJob:
+    model: str
+    scenario_id: str
+    ordinal: int
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__.splitlines()[0])
+    parser.add_argument("--env-url", required=True, help="sre-gym server base URL")
+    parser.add_argument("--scenarios", required=True, help="comma-separated scenario ids, difficulties, or all")
+    parser.add_argument("--models", required=True, help="comma-separated Anthropic model ids")
+    parser.add_argument("--episodes-per-model", type=int, default=1000)
+    parser.add_argument("--parallelism", type=int, default=20)
+    parser.add_argument("--output", required=True, help="output JSONL path")
+    parser.add_argument("--driver", choices=("anthropic", "heuristic"), default="anthropic")
+    parser.add_argument("--anthropic-api-key", default=os.getenv("ANTHROPIC_API_KEY"))
+    parser.add_argument("--anthropic-base-url", default=os.getenv("ANTHROPIC_BASE_URL"))
+    parser.add_argument("--max-tokens", type=int, default=320)
+    parser.add_argument("--env-timeout-s", type=float, default=45.0)
+    parser.add_argument("--anthropic-timeout-s", type=float, default=90.0)
+    parser.add_argument("--max-retries", type=int, default=3)
+    return parser.parse_args()
+def _split_csv(raw: str) -> list[str]:
+    return [token.strip() for token in raw.split(",") if token.strip()]
+def _resolve_scenarios(raw: str) -> list[str]:
+    scenario_ids: list[str] = []
+    for token in _split_csv(raw):
+        if token == "all":
+            scenario_ids.extend(SCENARIOS.keys())
+            continue
+        if token in SUPPORTED_DIFFICULTIES:
+            scenario_ids.extend(
+                scenario_id
+                for scenario_id, scenario in SCENARIOS.items()
+                if scenario["difficulty"] == token
+            )
+            continue
+        if token not in SCENARIOS:
+            raise SystemExit(f"Unknown scenario selector: {token}")
+        scenario_ids.append(token)
+    deduped: list[str] = []
+    seen: set[str] = set()
+    for scenario_id in scenario_ids:
+        if scenario_id not in seen:
+            deduped.append(scenario_id)
+            seen.add(scenario_id)
+    if not deduped:
+        raise SystemExit("No scenarios resolved from --scenarios")
+    return deduped
+def _resolve_models(raw: str) -> list[str]:
+    models = _split_csv(raw)
+    if not models:
+        raise SystemExit("No models resolved from --models")
+    return models
+def _service_order(observation: UnifiedIncidentObservation) -> list[str]:
+    services = list(observation.service_health.items())
+    services.sort(
+        key=lambda item: (
+            item[1].status == "healthy",
+            item[1].status == "isolated",
+            item[1].error_rate_pct,
+            item[1].latency_ms,
+        ),
+        reverse=True,
+    )
+    return [name for name, _payload in services]
+def _default_action_for_type(action_type: str, observation: UnifiedIncidentObservation) -> dict[str, Any]:
+    services = _service_order(observation)
+    service = services[0] if services else "database"
+    if action_type in {"query_logs", "query_dependencies", "query_deploys", "rollback_deploy", "restart_service", "isolate_service"}:
+        return {"action_type": action_type, "service": service}
+    if action_type == "query_metrics":
+        return {"action_type": action_type, "service": service, "metric": "cpu"}
+    if action_type == "run_check":
+        pending_checks = [check.name for check in observation.checks if not check.passed]
+        check_name = pending_checks[0] if pending_checks else "end_to_end"
+        return {"action_type": action_type, "check_name": check_name}
+    if action_type == "submit_hypothesis":
+        return {
+            "action_type": "submit_hypothesis",
+            "hypothesis": {
+                "root_cause": ROOT_CAUSE_OPTIONS[0],
+                "affected_services": services[:2] or ["database"],
+                "confidence": 0.5,
+                "recommended_next_action": "query_logs",
+            },
+        }
+    return {"action_type": action_type}
+def _build_fallback_action(observation: UnifiedIncidentObservation) -> UnifiedIncidentAction:
+    pending_checks = [check.name for check in observation.checks if not check.passed]
+    if observation.workflow_stage == "validation" and pending_checks:
+        return UnifiedIncidentAction(action_type="run_check", check_name=pending_checks[0])
+    if observation.workflow_stage == "validation" and not pending_checks:
+        return UnifiedIncidentAction(action_type="declare_resolved")
+    if observation.workflow_stage == "mitigation":
+        services = _service_order(observation)
+        service = services[0] if services else "database"
+        if "rollback_deploy" in observation.allowed_actions:
+            return UnifiedIncidentAction(action_type="rollback_deploy", service=service)
+        if "restart_service" in observation.allowed_actions:
+            return UnifiedIncidentAction(action_type="restart_service", service=service)
+    if "query_logs" in observation.allowed_actions:
+        services = _service_order(observation)
+        service = services[0] if services else "database"
+        return UnifiedIncidentAction(action_type="query_logs", service=service)
+    if "query_deploys" in observation.allowed_actions:
+        services = _service_order(observation)
+        service = services[0] if services else "database"
+        return UnifiedIncidentAction(action_type="query_deploys", service=service)
+    action_type = observation.allowed_actions[0]
+    return UnifiedIncidentAction(**_default_action_for_type(action_type, observation))
+def _extract_json_object(raw_text: str) -> str:
+    text = raw_text.strip()
+    if "```" in text:
+        parts = text.split("```")
+        if len(parts) >= 2:
+            text = parts[1]
+            if text.startswith("json"):
+                text = text[4:]
+    start = text.find("{")
+    end = text.rfind("}")
+    if start != -1 and end != -1 and start < end:
+        return text[start : end + 1].strip()
+    return text
+def _parse_action(raw_text: str, observation: UnifiedIncidentObservation) -> UnifiedIncidentAction | None:
+    candidate = _extract_json_object(raw_text)
+    if not candidate:
+        return None
+    try:
+        payload = json.loads(candidate)
+    except Exception:
+        return None
+    if not isinstance(payload, dict):
+        return None
+    if "action" in payload and "action_type" not in payload and isinstance(payload["action"], str):
+        payload["action_type"] = payload.pop("action")
+    if payload.get("action_type") not in observation.allowed_actions:
+        return None
+    try:
+        return UnifiedIncidentAction(**payload)
+    except Exception:
+        return None
+def _build_user_prompt(observation: UnifiedIncidentObservation) -> str:
+    required_lines = []
+    for action_name, fields in observation.required_fields_by_action.items():
+        required_lines.append(
+            f"- {action_name}: {', '.join(fields) if fields else '(no extra fields)'}"
+        )
+    service_names = ", ".join(sorted(observation.service_health))
+    return (
+        f"{observation.prompt_text}\n\n"
+        "JSON_RESPONSE_RULES:\n"
+        "- Return exactly one JSON object.\n"
+        "- Use only an allowed action_type.\n"
+        "- Include only the fields required for that action.\n"
+        f"- service must be one of: {service_names}\n"
+        f"- metric must be one of: {', '.join(METRIC_OPTIONS)}\n"
+        f"- check_name must be one of: {', '.join(CHECK_OPTIONS)}\n"
+        f"- hypothesis.root_cause must be one of: {', '.join(ROOT_CAUSE_OPTIONS)}\n"
+        "- hypothesis must include root_cause, affected_services, confidence, and recommended_next_action.\n"
+        "- Noise alerts are decoys; querying them hurts score.\n\n"
+        "REQUIRED_FIELDS_BY_ACTION:\n"
+        + "\n".join(required_lines)
+    )
+def _extract_text_response(message: Any) -> str:
+    parts = []
+    for block in getattr(message, "content", []):
+        if getattr(block, "type", "") == "text":
+            parts.append(getattr(block, "text", ""))
+    return "".join(parts).strip()
+async def _request_model_output(
+    *,
+    driver: str,
+    anthropic_client: Any,
+    model: str,
+    prompt: str,
+    fallback_action: UnifiedIncidentAction,
+    max_tokens: int,
+    max_retries: int,
+) -> tuple[str, str | None]:
+    if driver == "heuristic":
+        return json.dumps(fallback_action.model_dump(exclude_none=True), separators=(",", ":")), "heuristic_driver"
+    last_error: str | None = None
+    for attempt in range(1, max_retries + 1):
+        try:
+            message = await anthropic_client.messages.create(
+                model=model,
+                max_tokens=max_tokens,
+                temperature=0.0,
+                system=SYSTEM_PROMPT,
+                messages=[{"role": "user", "content": prompt}],
+            )
+            text = _extract_text_response(message)
+            if text:
+                return text, None
+            last_error = "empty_text_response"
+        except Exception as exc:  # pragma: no cover - exercised in real collection runs
+            last_error = f"{type(exc).__name__}: {exc}"
+        if attempt < max_retries:
+            await asyncio.sleep(min(2.0 * attempt, 5.0))
+    return json.dumps(fallback_action.model_dump(exclude_none=True), separators=(",", ":")), last_error or "model_request_failed"
+async def _collect_episode(
+    job: EpisodeJob,
+    *,
+    anthropic_client: Any,
+    args: argparse.Namespace,
+) -> dict[str, Any]:
+    trajectory: list[dict[str, Any]] = []
+    started = time.perf_counter()
+    steps = 0
+    async with UnifiedIncidentEnv(base_url=args.env_url) as env:
+        observation = (await env.reset(scenario_id=job.scenario_id, episode_id=str(uuid.uuid4()))).observation
+        while not observation.done:
+            prompt = _build_user_prompt(observation)
+            fallback_action = _build_fallback_action(observation)
+            response_text, driver_note = await _request_model_output(
+                driver=args.driver,
+                anthropic_client=anthropic_client,
+                model=job.model,
+                prompt=prompt,
+                fallback_action=fallback_action,
+                max_tokens=args.max_tokens,
+                max_retries=args.max_retries,
+            )
+            parsed_action = _parse_action(response_text, observation)
+            action = parsed_action or fallback_action
+            next_step = await env.step(action)
+            next_observation = next_step.observation
+            step_failure = next_observation.failure_type
+            if parsed_action is None and driver_note is None:
+                driver_note = "invalid_model_output"
+            if driver_note is not None and action == fallback_action:
+                step_failure = step_failure or driver_note
+            trajectory.append(
+                {
+                    "tick": observation.tick_count,
+                    "prompt": prompt,
+                    "response_text": response_text,
+                    "action": action.model_dump(exclude_none=True),
+                    "reward": float(next_observation.reward),
+                    "tool_output": next_observation.tool_output,
+                    "failure_type": step_failure,
+                    "workflow_stage": next_observation.workflow_stage,
+                }
+            )
+            observation = next_observation
+            steps += 1
+    return {
+        "episode_id": str(uuid.uuid4()),
+        "scenario_id": job.scenario_id,
+        "model": job.model,
+        "final_score": float(observation.final_score),
+        "incident_resolved": bool(observation.incident_resolved),
+        "steps": steps,
+        "elapsed_s": round(time.perf_counter() - started, 4),
+        "trajectory": trajectory,
+    }
+async def _worker(
+    *,
+    name: str,
+    jobs: asyncio.Queue[EpisodeJob],
+    anthropic_client: Any,
+    args: argparse.Namespace,
+    write_lock: asyncio.Lock,
+    output_path: Path,
+    counters: dict[str, int],
+) -> None:
+    while True:
+        job = await jobs.get()
+        try:
+            record = await _collect_episode(
+                job,
+                anthropic_client=anthropic_client,
+                args=args,
+            )
+            async with write_lock:
+                with output_path.open("a", encoding="utf-8") as handle:
+                    handle.write(json.dumps(record))
+                    handle.write("\n")
+            counters["completed"] += 1
+            if record["incident_resolved"]:
+                counters["resolved"] += 1
+            print(
+                f"[{counters['completed']}/{counters['total']}] worker={name} model={job.model} "
+                f"scenario={job.scenario_id} score={record['final_score']:.3f} "
+                f"resolved={str(record['incident_resolved']).lower()} steps={record['steps']}",
+                file=sys.stderr,
+                flush=True,
+            )
+        finally:
+            jobs.task_done()
+async def _run_collection(args: argparse.Namespace) -> None:
+    scenario_ids = _resolve_scenarios(args.scenarios)
+    models = _resolve_models(args.models)
+    if args.driver == "anthropic":
+        if AsyncAnthropic is None:
+            raise SystemExit("anthropic is not installed. Add it via train/requirements-train.txt before running.")
+        if not args.anthropic_api_key:
+            raise SystemExit("ANTHROPIC_API_KEY is required when --driver=anthropic")
+    output_path = Path(args.output)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    if output_path.exists():
+        output_path.unlink()
+    jobs: asyncio.Queue[EpisodeJob] = asyncio.Queue()
+    for model in models:
+        for ordinal in range(args.episodes_per_model):
+            scenario_id = scenario_ids[ordinal % len(scenario_ids)]
+            jobs.put_nowait(EpisodeJob(model=model, scenario_id=scenario_id, ordinal=ordinal))
+    probe_client = httpx.AsyncClient(
+        base_url=args.env_url.rstrip("/"),
+        timeout=httpx.Timeout(args.env_timeout_s),
+        follow_redirects=True,
+    )
+    health = await probe_client.get("/health")
+    health.raise_for_status()
+    await probe_client.aclose()
+    anthropic_http_client = httpx.AsyncClient(
+        timeout=httpx.Timeout(args.anthropic_timeout_s),
+        limits=httpx.Limits(
+            max_connections=max(args.parallelism * 2, 20),
+            max_keepalive_connections=max(args.parallelism, 10),
+        ),
+        follow_redirects=True,
+    )
+    anthropic_client = None
+    if args.driver == "anthropic":
+        anthropic_client = AsyncAnthropic(
+            api_key=args.anthropic_api_key,
+            base_url=args.anthropic_base_url or None,
+            http_client=anthropic_http_client,
+        )
+    write_lock = asyncio.Lock()
+    counters = {
+        "completed": 0,
+        "resolved": 0,
+        "total": jobs.qsize(),
+    }
+    workers = [
+        asyncio.create_task(
+            _worker(
+                name=f"w{index + 1}",
+                jobs=jobs,
+                anthropic_client=anthropic_client,
+                args=args,
+                write_lock=write_lock,
+                output_path=output_path,
+                counters=counters,
+            )
+        )
+        for index in range(min(args.parallelism, counters["total"]))
+    ]
+    try:
+        await jobs.join()
+    finally:
+        for worker in workers:
+            worker.cancel()
+        await asyncio.gather(*workers, return_exceptions=True)
+        await anthropic_http_client.aclose()
+    success_rate = counters["resolved"] / counters["total"] if counters["total"] else 0.0
+    print(
+        f"completed={counters['completed']} resolved={counters['resolved']} "
+        f"success_rate={success_rate:.3f} output={output_path}",
+        file=sys.stderr,
+        flush=True,
+    )
+def main() -> None:
+    args = parse_args()
+    asyncio.run(_run_collection(args))
+if __name__ == "__main__":
+    main()

train/requirements-train.txt ADDED Viewed

	@@ -0,0 +1,18 @@

+# Pinned training-stack deps for the sanity_run.ipynb Colab notebook.
+#
+# Qwen3.5 4B support is still maturing in Unsloth; the version range below
+# reflects what landed in their main branch as of 2026-04. If Qwen3.5 4B
+# fails to load tonight, fall back to Qwen3 4B by changing MODEL_NAME in the
+# notebook — no other change needed.
+unsloth>=2025.12,<2026.06
+unsloth_zoo>=2025.12,<2026.06
+trl>=0.12.0,<0.16.0
+transformers>=4.48.0,<4.60.0
+accelerate>=1.2.0,<2.0.0
+peft>=0.14.0,<0.20.0
+datasets>=3.0.0,<4.0.0
+wandb>=0.18.0,<1.0.0
+bitsandbytes>=0.45.0
+httpx>=0.27.0
+anthropic>=0.97.0,<1.0.0

train/sanity_run.ipynb ADDED Viewed

	@@ -0,0 +1,326 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# sre-gym — Training pipeline sanity run\n",
+    "\n",
+    "Purpose: verify the Colab+Unsloth+TRL+wandb pipeline compiles and runs end-to-end on an A100 *before* the hackathon. This notebook is not meant to train anything useful. It runs 200 SFT steps on a tiny hand-made dataset and saves a checkpoint.\n",
+    "\n",
+    "What a successful run looks like:\n",
+    "1. All deps install without version conflicts\n",
+    "2. `Qwen3.5-4B-Instruct` (or `Qwen3-4B-Instruct` fallback) loads in 4-bit via Unsloth\n",
+    "3. 200 steps of LoRA SFT run without OOM on A100 40GB\n",
+    "4. `wandb` logs show loss decreasing\n",
+    "5. Checkpoint is saved to `/content/sanity_ckpt/`\n",
+    "\n",
+    "Friday work: real dataset (2000+ Claude-driven trajectories), 2000+ SFT steps, then GRPO."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 0. Colab runtime sanity"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-smi\n",
+    "!python -c 'import torch; print(\"torch\", torch.__version__, \"cuda\", torch.cuda.is_available())'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Install deps"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "# Unsloth's Colab install idiom (handles torch/xformers version pinning):\n",
+    "pip install -q --upgrade pip\n",
+    "pip install -q \"unsloth[colab-new]>=2025.12,<2026.06\"\n",
+    "pip install -q \"unsloth_zoo>=2025.12,<2026.06\"\n",
+    "pip install -q \"trl>=0.12,<0.16\" \"transformers>=4.48,<4.60\" \"peft>=0.14,<0.20\" \"accelerate>=1.2,<2.0\"\n",
+    "pip install -q \"datasets>=3.0,<4.0\" \"wandb>=0.18,<1.0\" \"bitsandbytes>=0.45\" httpx"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Config\n",
+    "\n",
+    "If Qwen3.5 4B fails to load, swap `MODEL_NAME` to the Qwen3 4B fallback — no other change needed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "# Primary target (user-selected).\n",
+    "MODEL_NAME = \"unsloth/Qwen3.5-4B-Instruct-bnb-4bit\"\n",
+    "# Fallback if Unsloth can't load Qwen3.5 on Colab tonight.\n",
+    "FALLBACK_MODEL_NAME = \"unsloth/Qwen3-4B-Instruct-bnb-4bit\"\n",
+    "\n",
+    "MAX_SEQ_LENGTH = 4096\n",
+    "LORA_R = 32\n",
+    "LORA_ALPHA = 32\n",
+    "LEARNING_RATE = 2e-4\n",
+    "NUM_STEPS = 200\n",
+    "BATCH_SIZE = 2\n",
+    "GRAD_ACCUM = 4\n",
+    "OUT_DIR = \"/content/sanity_ckpt\"\n",
+    "\n",
+    "WANDB_PROJECT = os.environ.get(\"WANDB_PROJECT\", \"sre-gym-sanity\")\n",
+    "WANDB_RUN_NAME = os.environ.get(\"WANDB_RUN_NAME\", \"qwen35-4b-sft-toy-200\")\n",
+    "\n",
+    "os.environ.setdefault(\"WANDB_MODE\", \"online\")  # flip to \"offline\" if no wandb login\n",
+    "print(f\"Primary model: {MODEL_NAME}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Load model via Unsloth (with fallback)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from unsloth import FastLanguageModel\n",
+    "import torch\n",
+    "\n",
+    "model = None\n",
+    "tokenizer = None\n",
+    "errors = []\n",
+    "\n",
+    "for candidate in (MODEL_NAME, FALLBACK_MODEL_NAME):\n",
+    "    try:\n",
+    "        print(f\"Attempting to load: {candidate}\")\n",
+    "        model, tokenizer = FastLanguageModel.from_pretrained(\n",
+    "            model_name=candidate,\n",
+    "            max_seq_length=MAX_SEQ_LENGTH,\n",
+    "            dtype=None,  # let Unsloth pick\n",
+    "            load_in_4bit=True,\n",
+    "        )\n",
+    "        MODEL_NAME = candidate\n",
+    "        print(f\"Loaded {candidate} ok\")\n",
+    "        break\n",
+    "    except Exception as exc:\n",
+    "        errors.append((candidate, repr(exc)))\n",
+    "        print(f\"Load failed for {candidate}: {exc}\")\n",
+    "\n",
+    "if model is None:\n",
+    "    raise RuntimeError(\n",
+    "        \"Both Qwen3.5 4B and Qwen3 4B failed to load via Unsloth. \"\n",
+    "        \"Investigate Unsloth version mismatch before Friday. Errors: \" + str(errors)\n",
+    "    )\n",
+    "\n",
+    "model = FastLanguageModel.get_peft_model(\n",
+    "    model,\n",
+    "    r=LORA_R,\n",
+    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
+    "    lora_alpha=LORA_ALPHA,\n",
+    "    lora_dropout=0.0,\n",
+    "    bias=\"none\",\n",
+    "    use_gradient_checkpointing=\"unsloth\",\n",
+    "    random_state=42,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Toy training dataset (hand-made, 10 examples)\n",
+    "\n",
+    "These are derived from the 3 deterministic baseline trajectories. Purpose: exercise the tokenize+forward+backward+optimizer path. Not intended to generalize."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from datasets import Dataset\n",
+    "\n",
+    "SYSTEM = 'You are an SRE agent. Respond with one UnifiedIncidentAction JSON object on each turn.'\n",
+    "\n",
+    "TOY_EXAMPLES = [\n",
+    "    (\"worker_deploy_cascade tick 1 — DB crashed, worker degraded, recent worker deploy\",\n",
+    "     '{\"action_type\":\"query_deploys\",\"service\":\"worker\"}'),\n",
+    "    (\"worker_deploy_cascade tick 2 — saw worker@2026.04.23-bad 12m ago\",\n",
+    "     '{\"action_type\":\"query_logs\",\"service\":\"worker\"}'),\n",
+    "    (\"worker_deploy_cascade tick 3 — confirmed worker-originated harm\",\n",
+    "     '{\"action_type\":\"rollback_deploy\",\"service\":\"worker\"}'),\n",
+    "    (\"worker_deploy_cascade tick 4 — worker healthy, DB still crashed\",\n",
+    "     '{\"action_type\":\"restart_service\",\"service\":\"database\"}'),\n",
+    "    (\"worker_deploy_cascade tick 5 — all services healthy, checks pending\",\n",
+    "     '{\"action_type\":\"run_check\",\"check_name\":\"end_to_end\"}'),\n",
+    "    (\"db_config_rollout tick 1 — db degraded, worker decoy, pool-acquire errors\",\n",
+    "     '{\"action_type\":\"query_deploys\",\"service\":\"database\"}'),\n",
+    "    (\"db_config_rollout tick 2 — saw db@2026.04.24-cfg lowering pool to 12\",\n",
+    "     '{\"action_type\":\"rollback_deploy\",\"service\":\"database\"}'),\n",
+    "    (\"gateway_auth_rollout tick 1 — gateway 40% 401s, auth rollout 9m ago\",\n",
+    "     '{\"action_type\":\"query_deploys\",\"service\":\"api-gateway\"}'),\n",
+    "    (\"gateway_auth_rollout tick 2 — confirmed gateway@2026.04.24-auth is cause\",\n",
+    "     '{\"action_type\":\"rollback_deploy\",\"service\":\"api-gateway\"}'),\n",
+    "    (\"gateway_auth_rollout tick 3 — gateway healthy, verify end-to-end\",\n",
+    "     '{\"action_type\":\"run_check\",\"check_name\":\"end_to_end\"}'),\n",
+    "]\n",
+    "\n",
+    "def _format(example):\n",
+    "    prompt, action = example\n",
+    "    messages = [\n",
+    "        {\"role\": \"system\", \"content\": SYSTEM},\n",
+    "        {\"role\": \"user\", \"content\": prompt},\n",
+    "        {\"role\": \"assistant\", \"content\": action},\n",
+    "    ]\n",
+    "    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)\n",
+    "    return {\"text\": text}\n",
+    "\n",
+    "raw = [_format(ex) for ex in TOY_EXAMPLES]\n",
+    "dataset = Dataset.from_list(raw)\n",
+    "print(f\"toy dataset: {len(dataset)} rows\")\n",
+    "print(\"sample text (first 400 chars):\")\n",
+    "print(dataset[0]['text'][:400])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. SFT training — 200 steps"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import SFTTrainer, SFTConfig\n",
+    "\n",
+    "cfg = SFTConfig(\n",
+    "    output_dir=OUT_DIR,\n",
+    "    per_device_train_batch_size=BATCH_SIZE,\n",
+    "    gradient_accumulation_steps=GRAD_ACCUM,\n",
+    "    warmup_steps=10,\n",
+    "    max_steps=NUM_STEPS,\n",
+    "    learning_rate=LEARNING_RATE,\n",
+    "    fp16=not torch.cuda.is_bf16_supported(),\n",
+    "    bf16=torch.cuda.is_bf16_supported(),\n",
+    "    logging_steps=10,\n",
+    "    save_steps=100,\n",
+    "    save_total_limit=2,\n",
+    "    optim=\"adamw_8bit\",\n",
+    "    weight_decay=0.01,\n",
+    "    lr_scheduler_type=\"linear\",\n",
+    "    seed=42,\n",
+    "    report_to=\"wandb\",\n",
+    "    run_name=WANDB_RUN_NAME,\n",
+    "    max_seq_length=MAX_SEQ_LENGTH,\n",
+    "    dataset_text_field=\"text\",\n",
+    "    packing=False,\n",
+    ")\n",
+    "\n",
+    "os.environ.setdefault(\"WANDB_PROJECT\", WANDB_PROJECT)\n",
+    "\n",
+    "trainer = SFTTrainer(\n",
+    "    model=model,\n",
+    "    tokenizer=tokenizer,\n",
+    "    train_dataset=dataset,\n",
+    "    args=cfg,\n",
+    ")\n",
+    "\n",
+    "trainer_stats = trainer.train()\n",
+    "print(trainer_stats)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Save LoRA adapter + sanity-check inference"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.save_pretrained(OUT_DIR)\n",
+    "tokenizer.save_pretrained(OUT_DIR)\n",
+    "\n",
+    "from unsloth import FastLanguageModel\n",
+    "FastLanguageModel.for_inference(model)\n",
+    "\n",
+    "test_prompt = 'worker_deploy_cascade tick 1 — DB crashed, worker degraded, recent worker deploy'\n",
+    "messages = [\n",
+    "    {\"role\": \"system\", \"content\": SYSTEM},\n",
+    "    {\"role\": \"user\", \"content\": test_prompt},\n",
+    "]\n",
+    "inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").to(\"cuda\")\n",
+    "out = model.generate(input_ids=inputs, max_new_tokens=64, temperature=0.0, do_sample=False)\n",
+    "print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Verification checklist\n",
+    "\n",
+    "- [ ] Cell 3 loaded a model without OOM or import errors\n",
+    "- [ ] Cell 4 produced a chat-formatted dataset (no tokenizer errors)\n",
+    "- [ ] Cell 5 ran 200 steps, wandb logged a decreasing loss curve\n",
+    "- [ ] Cell 6 generated a JSON-ish action for the test prompt\n",
+    "- [ ] `/content/sanity_ckpt/adapter_model.safetensors` exists\n",
+    "\n",
+    "If any box is unchecked, debug tonight — do not enter Friday with an unknown failure mode."
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "A100",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

unified_incident_env/models.py CHANGED Viewed

@@ -21,9 +21,23 @@ ActionType = Literal[
     "submit_hypothesis",
     "declare_resolved",
 ]
-Difficulty = Literal["easy"]
 MetricName = Literal["cpu", "error_rate", "latency"]
-ServiceName = Literal["api-gateway", "cache", "database", "worker"]
 ServiceStatus = Literal["healthy", "degraded", "crashed", "isolated"]
 WorkflowStage = Literal["triage", "mitigation", "validation", "resolved"]
 CheckName = Literal["database_recovery", "end_to_end"]
@@ -180,10 +194,13 @@ class UnifiedIncidentObservation(Observation):
     difficulty: Difficulty
     workflow_stage: WorkflowStage
     active_alerts: list[Alert] = Field(default_factory=list)
     service_health: dict[str, ServiceHealth] = Field(default_factory=dict)
     discovered_evidence: list[str] = Field(default_factory=list)
     recent_deploys: list[str] = Field(default_factory=list)
     checks: list[CheckResult] = Field(default_factory=list)
     user_impact: float = Field(ge=0.0, le=1.0)
     slo_burn_rate: float = Field(ge=0.0, le=1.0)
     incident_resolved: bool = False
@@ -222,10 +239,13 @@ class UnifiedIncidentState(State):
     max_ticks: int
     workflow_stage: WorkflowStage
     active_alerts: list[Alert] = Field(default_factory=list)
     service_health: dict[str, ServiceHealth] = Field(default_factory=dict)
     discovered_evidence: list[str] = Field(default_factory=list)
     recent_deploys: list[str] = Field(default_factory=list)
     checks: list[CheckResult] = Field(default_factory=list)
     user_impact: float = Field(ge=0.0, le=1.0)
     slo_burn_rate: float = Field(ge=0.0, le=1.0)
     incident_resolved: bool = False

     "submit_hypothesis",
     "declare_resolved",
 ]
+Difficulty = Literal["easy", "medium", "hard"]
 MetricName = Literal["cpu", "error_rate", "latency"]
+ServiceName = Literal[
+    "api-gateway",
+    "cache",
+    "database",
+    "worker",
+    # Noise-service pool surfaced by scenario.difficulty_knobs. These never
+    # appear in service_health (so agents can't query them through the
+    # action schema), but they do appear in alerts as distractors.
+    "stripe-webhook",
+    "email-queue",
+    "sessions-redis",
+    "image-cdn",
+    "feature-flags",
+    "analytics",
+]
 ServiceStatus = Literal["healthy", "degraded", "crashed", "isolated"]
 WorkflowStage = Literal["triage", "mitigation", "validation", "resolved"]
 CheckName = Literal["database_recovery", "end_to_end"]
     difficulty: Difficulty
     workflow_stage: WorkflowStage
     active_alerts: list[Alert] = Field(default_factory=list)
+    noise_alerts: list[Alert] = Field(default_factory=list)
     service_health: dict[str, ServiceHealth] = Field(default_factory=dict)
     discovered_evidence: list[str] = Field(default_factory=list)
     recent_deploys: list[str] = Field(default_factory=list)
     checks: list[CheckResult] = Field(default_factory=list)
+    blast_radius: int = 0
+    noise_queries: int = 0
     user_impact: float = Field(ge=0.0, le=1.0)
     slo_burn_rate: float = Field(ge=0.0, le=1.0)
     incident_resolved: bool = False
     max_ticks: int
     workflow_stage: WorkflowStage
     active_alerts: list[Alert] = Field(default_factory=list)
+    noise_alerts: list[Alert] = Field(default_factory=list)
     service_health: dict[str, ServiceHealth] = Field(default_factory=dict)
     discovered_evidence: list[str] = Field(default_factory=list)
     recent_deploys: list[str] = Field(default_factory=list)
     checks: list[CheckResult] = Field(default_factory=list)
+    blast_radius: int = 0
+    noise_queries: int = 0
     user_impact: float = Field(ge=0.0, le=1.0)
     slo_burn_rate: float = Field(ge=0.0, le=1.0)
     incident_resolved: bool = False

unified_incident_env/server/app.py CHANGED Viewed

@@ -68,7 +68,7 @@ def create_compatible_app():
         env_factory,
         UnifiedIncidentAction,
         UnifiedIncidentObservation,
-        max_concurrent_envs=1,
     )
     @app.get("/", include_in_schema=False)

         env_factory,
         UnifiedIncidentAction,
         UnifiedIncidentObservation,
+        max_concurrent_envs=int(os.environ.get("MAX_CONCURRENT_ENVS", "32")),
     )
     @app.get("/", include_in_schema=False)

unified_incident_env/server/challenge.py CHANGED Viewed

@@ -3,6 +3,9 @@
 from __future__ import annotations
 from copy import deepcopy
 from typing import Any
 from ..models import (
@@ -15,8 +18,11 @@ from ..models import (
 )
 DEFAULT_SCENARIO_ID = "worker_deploy_cascade"
-SCENARIOS: dict[str, dict[str, Any]] = {
     "worker_deploy_cascade": {
         "id": "worker_deploy_cascade",
         "difficulty": "easy",
@@ -143,9 +149,525 @@ SCENARIOS: dict[str, dict[str, Any]] = {
             "affected_services": ["worker", "database", "api-gateway"],
             "best_next_action": "rollback_deploy",
         },
-    }
 }
 _RUNTIME_PROGRESS: dict[str, Any] | None = None
@@ -155,15 +677,26 @@ def get_scenario(scenario_id: str) -> dict[str, Any]:
     return deepcopy(SCENARIOS[scenario_id])
-def scenario_for_difficulty(difficulty: str) -> dict[str, Any]:
-    for scenario in SCENARIOS.values():
-        if scenario["difficulty"] == difficulty:
-            return deepcopy(scenario)
     raise ValueError(f"Unknown difficulty {difficulty!r}")
-def list_scenarios(difficulty: str | None = None) -> ScenarioCatalog:
-    if difficulty is not None and difficulty != "easy":
         raise ValueError(f"Unknown difficulty {difficulty!r}")
     scenarios = [
         ScenarioSummary(
@@ -175,19 +708,18 @@ def list_scenarios(difficulty: str | None = None) -> ScenarioCatalog:
             optimal_ticks=scenario["optimal_ticks"],
         )
         for scenario in SCENARIOS.values()
-        if difficulty is None or scenario["difficulty"] == difficulty
     ]
     return ScenarioCatalog(
         default_scenario_id=DEFAULT_SCENARIO_ID,
-        available_difficulties=["easy"],
         filtered_difficulty=difficulty,
         scenarios=scenarios,
     )
-def _baseline_actions(scenario_id: str) -> list[BaselineStep]:
-    if scenario_id != DEFAULT_SCENARIO_ID:
-        raise ValueError(f"No baseline for scenario_id {scenario_id!r}")
     return [
         BaselineStep(
             action=UnifiedIncidentAction(action_type="query_deploys", service="worker"),
@@ -240,13 +772,135 @@ def _baseline_actions(scenario_id: str) -> list[BaselineStep]:
     ]
-def list_baselines(scenario_id: str | None = None) -> BaselineCatalog:
-    scenario_ids = [scenario_id] if scenario_id is not None else [DEFAULT_SCENARIO_ID]
     baselines = [
         BaselineDefinition(
             scenario_id=current_id,
             name="deterministic-remediation-baseline",
-            description="Minimal honest baseline that diagnoses from evidence, rolls back the worker, restarts the database, verifies recovery, and then declares resolved.",
             optimal_ticks=SCENARIOS[current_id]["optimal_ticks"],
             actions=_baseline_actions(current_id),
         )

 from __future__ import annotations
 from copy import deepcopy
+import hashlib
+import random
+import re
 from typing import Any
 from ..models import (
 )
 DEFAULT_SCENARIO_ID = "worker_deploy_cascade"
+PROCGEN_VARIANTS_PER_TEMPLATE = 4
+_MINUTES_AGO_RE = re.compile(r"(\d+)\s+minutes ago")
+_ROLLOUT_VERSION_RE = re.compile(r"(@\d{4}\.\d{2}\.\d{2}-)([a-z0-9-]+)")
+_BASE_SCENARIOS: dict[str, dict[str, Any]] = {
     "worker_deploy_cascade": {
         "id": "worker_deploy_cascade",
         "difficulty": "easy",
             "affected_services": ["worker", "database", "api-gateway"],
             "best_next_action": "rollback_deploy",
         },
+        "remediation_recipe": {
+            "rollback_target": "worker",
+            "restart_target": "database",
+            "isolate_target": "worker",
+            "restart_requires_cause_removed": True,
+            "incident_driver": "worker",
+            "resolution_check": "end_to_end",
+        },
+        "post_rollback_services": {
+            "worker": {"status": "healthy", "cpu_pct": 32.0, "memory_pct": 37.0, "error_rate_pct": 2.0, "latency_ms": 40.0},
+        },
+        "post_rollback_user_impact": 0.55,
+        "post_rollback_slo_burn": 0.58,
+        "post_restart_services": {
+            "database": {"status": "healthy", "cpu_pct": 34.0, "memory_pct": 39.0, "error_rate_pct": 0.0, "latency_ms": 22.0},
+            "api-gateway": {"status": "healthy", "cpu_pct": 28.0, "memory_pct": 31.0, "error_rate_pct": 0.0, "latency_ms": 38.0},
+        },
+        "post_restart_user_impact": 0.14,
+        "post_restart_slo_burn": 0.18,
+        "post_isolate_services": {
+            "worker": {"status": "isolated", "cpu_pct": 8.0, "memory_pct": 18.0, "error_rate_pct": 0.0, "latency_ms": 0.0},
+            "database": {"status": "healthy", "cpu_pct": 41.0, "memory_pct": 46.0, "error_rate_pct": 0.0, "latency_ms": 26.0},
+            "api-gateway": {"status": "degraded", "cpu_pct": 34.0, "memory_pct": 33.0, "error_rate_pct": 7.0, "latency_ms": 91.0},
+        },
+        "post_isolate_user_impact": 0.45,
+        "post_isolate_slo_burn": 0.47,
+        "degraded_services": {
+            "worker": {"status": "degraded", "cpu_pct": 88.0, "memory_pct": 71.0, "error_rate_pct": 19.0, "latency_ms": 420.0},
+            "database": {"status": "crashed", "cpu_pct": 99.0, "memory_pct": 97.0, "error_rate_pct": 100.0, "latency_ms": 0.0},
+            "api-gateway": {"status": "degraded", "cpu_pct": 61.0, "memory_pct": 38.0, "error_rate_pct": 24.0, "latency_ms": 640.0},
+        },
+        "degraded_user_impact": 0.82,
+        "degraded_slo_burn": 0.91,
+        "failure_messages": {
+            "wrong_rollback_target": "Rolling back a service without a causal link wastes time and risk.",
+            "low_value_restart": "Restarting that service is not the safe next remediation step for this incident.",
+            "premature_restart": "Restarting before removing the trigger only causes another crash loop.",
+            "wrong_isolation_target": "Isolating that service does not contain the dominant failure path.",
+        },
+        "difficulty_knobs": {
+            "noise_services": ["stripe-webhook", "email-queue"],
+            "noise_alerts": [
+                {"service": "stripe-webhook", "severity": "warning", "message": "Stripe webhook retry volume slightly elevated (unrelated noise)."},
+                {"service": "email-queue", "severity": "warning", "message": "Email queue depth up 15% on a recurring 6h cycle (unrelated noise)."},
+            ],
+            "noise_logs": {
+                "stripe-webhook": "Webhook retries are within normal diurnal bounds; no payment-path regression.",
+                "email-queue": "Queue depth tracks the usual Monday-evening marketing batch; no regression.",
+            },
+            "blast_radius_budget": 2,
+        },
+    },
+    "db_config_rollout": {
+        "id": "db_config_rollout",
+        "difficulty": "medium",
+        "name": "Database Config Rollout Regression",
+        "description": (
+            "A database config push cut connection pool size and write requests now time out. "
+            "A separate worker deploy landed around the same time and looks suspicious but is not the cause. "
+            "The agent must avoid the decoy, roll back the database config, restart it, and verify recovery."
+        ),
+        "root_cause": "A bad database config rollout shrank the connection pool and is dropping writes.",
+        "optimal_ticks": 10,
+        "max_ticks": 12,
+        "critical_service_weights": {
+            "worker": 0.2,
+            "database": 0.5,
+            "api-gateway": 0.3,
+            "cache": 0.0,
+        },
+        "reward_config": {
+            "step_cost": 0.01,
+            "redundant_action_penalty": 0.02,
+            "unsafe_action_penalty": 0.08,
+            "premature_resolution_penalty": 0.2,
+            "successful_resolution_bonus": 0.25,
+            "hypothesis_bonus_scale": 0.12,
+            "forbidden_reward_sources": [
+                "evidence_discovery",
+                "query_success",
+                "unlock_events",
+                "stage_advancement",
+                "patch_id_selection",
+            ],
+        },
+        "initial_services": {
+            "api-gateway": {
+                "status": "degraded",
+                "cpu_pct": 44.0,
+                "memory_pct": 36.0,
+                "error_rate_pct": 17.0,
+                "latency_ms": 520.0,
+            },
+            "cache": {
+                "status": "healthy",
+                "cpu_pct": 20.0,
+                "memory_pct": 26.0,
+                "error_rate_pct": 0.0,
+                "latency_ms": 15.0,
+            },
+            "database": {
+                "status": "degraded",
+                "cpu_pct": 62.0,
+                "memory_pct": 54.0,
+                "error_rate_pct": 48.0,
+                "latency_ms": 880.0,
+            },
+            "worker": {
+                "status": "degraded",
+                "cpu_pct": 51.0,
+                "memory_pct": 44.0,
+                "error_rate_pct": 12.0,
+                "latency_ms": 310.0,
+            },
+        },
+        "initial_alerts": [
+            {
+                "service": "database",
+                "severity": "critical",
+                "message": "Database connection acquire timeouts at 48% and climbing.",
+            },
+            {
+                "service": "api-gateway",
+                "severity": "warning",
+                "message": "Write-path requests are returning sustained 5xx.",
+            },
+            {
+                "service": "worker",
+                "severity": "warning",
+                "message": "Worker write latency is elevated; retries are climbing.",
+            },
+        ],
+        "logs": {
+            "api-gateway": (
+                "Gateway upstream errors are downstream-driven: writes to the worker path return pool-exhaustion "
+                "errors originating from the database. No gateway deploys recorded in the last 24h."
+            ),
+            "cache": "Cache reads are healthy and unrelated to the current write-path failures.",
+            "database": (
+                "Database logs show 'could not acquire connection' errors immediately after config rollout "
+                "db@2026.04.24-cfg lowered max_connections from 80 to 12."
+            ),
+            "worker": (
+                "Worker logs show retries driven by downstream database pool exhaustion, not local faults. "
+                "Worker code deploy worker@2026.04.24-refactor is unrelated to the pool error signature."
+            ),
+        },
+        "metrics": {
+            "api-gateway": {
+                "error_rate": "Gateway 5xx rate is 17% and matches the database pool-exhaustion windows one-for-one.",
+                "latency": "Gateway p95 climbed to 520ms waiting on database connection acquire.",
+            },
+            "database": {
+                "cpu": "Database CPU is moderate (~62%), so this is not a compute overload pattern.",
+                "error_rate": "Database error rate is 48% and dominated by 'connection acquire timeout'.",
+                "latency": "Database write latency jumped to 880ms after the config rollout.",
+            },
+            "worker": {
+                "cpu": "Worker CPU is 51% — no local overload; retries are reactive.",
+                "error_rate": "Worker errors are retries against the saturated database pool.",
+            },
+        },
+        "dependencies": {
+            "api-gateway": "api-gateway -> worker -> database",
+            "worker": "worker -> database",
+            "database": "database is the terminal dependency; pool exhaustion here starves all upstream writers",
+        },
+        "deploy_history": {
+            "api-gateway": "No gateway deploys in the last 24h.",
+            "cache": "No cache deploys in the last 24h.",
+            "database": "Applied config db@2026.04.24-cfg 15 minutes ago (max_connections 80 -> 12).",
+            "worker": "Rolled out worker@2026.04.24-refactor 22 minutes ago (unrelated code cleanup).",
+        },
+        "checks": {
+            "database_recovery": "Confirms database write latency and pool health are back within SLO.",
+            "end_to_end": "Confirms gateway write-path traffic succeeds end-to-end.",
+        },
+        "truth": {
+            "root_cause": "database_only_failure",
+            "affected_services": ["database", "api-gateway", "worker"],
+            "best_next_action": "rollback_deploy",
+        },
+        "remediation_recipe": {
+            "rollback_target": "database",
+            "restart_target": "database",
+            "isolate_target": None,
+            "restart_requires_cause_removed": True,
+            "incident_driver": "database",
+            "resolution_check": "end_to_end",
+        },
+        "post_rollback_services": {
+            "database": {"status": "degraded", "cpu_pct": 48.0, "memory_pct": 42.0, "error_rate_pct": 6.0, "latency_ms": 120.0},
+        },
+        "post_rollback_user_impact": 0.40,
+        "post_rollback_slo_burn": 0.45,
+        "post_restart_services": {
+            "database": {"status": "healthy", "cpu_pct": 36.0, "memory_pct": 40.0, "error_rate_pct": 0.0, "latency_ms": 26.0},
+            "api-gateway": {"status": "healthy", "cpu_pct": 29.0, "memory_pct": 30.0, "error_rate_pct": 0.0, "latency_ms": 44.0},
+            "worker": {"status": "healthy", "cpu_pct": 33.0, "memory_pct": 36.0, "error_rate_pct": 1.0, "latency_ms": 48.0},
+        },
+        "post_restart_user_impact": 0.10,
+        "post_restart_slo_burn": 0.14,
+        "post_isolate_services": {},
+        "post_isolate_user_impact": 0.70,
+        "post_isolate_slo_burn": 0.75,
+        "degraded_services": {
+            "database": {"status": "degraded", "cpu_pct": 62.0, "memory_pct": 54.0, "error_rate_pct": 48.0, "latency_ms": 880.0},
+            "api-gateway": {"status": "degraded", "cpu_pct": 44.0, "memory_pct": 36.0, "error_rate_pct": 17.0, "latency_ms": 520.0},
+            "worker": {"status": "degraded", "cpu_pct": 51.0, "memory_pct": 44.0, "error_rate_pct": 12.0, "latency_ms": 310.0},
+        },
+        "degraded_user_impact": 0.70,
+        "degraded_slo_burn": 0.78,
+        "failure_messages": {
+            "wrong_rollback_target": "The worker deploy is a decoy; worker errors are reactive to database pool exhaustion.",
+            "low_value_restart": "Restarting that service does not address a database-config regression.",
+            "premature_restart": "Restarting the database before rolling back the config will re-inherit the 12-connection pool and fail again.",
+            "wrong_isolation_target": "Isolation is not useful here: the cause is a config regression, not a runaway service.",
+        },
+        "difficulty_knobs": {
+            "noise_services": ["sessions-redis", "analytics"],
+            "noise_alerts": [
+                {"service": "sessions-redis", "severity": "warning", "message": "Sessions-redis p99 latency nudged up 8ms (unrelated noise)."},
+                {"service": "analytics", "severity": "warning", "message": "Analytics consumer lag up to 45s from baseline 30s (unrelated noise)."},
+            ],
+            "noise_logs": {
+                "sessions-redis": "No errors on sessions-redis; hit ratio stable.",
+                "analytics": "Analytics consumer lag fluctuation consistent with upstream Kafka producer batching, unrelated to current incident.",
+            },
+            "blast_radius_budget": 2,
+        },
+    },
+    "gateway_auth_rollout": {
+        "id": "gateway_auth_rollout",
+        "difficulty": "hard",
+        "name": "Gateway Auth Rollout Regression",
+        "description": (
+            "A new api-gateway auth-middleware rollout is rejecting ~40% of valid logins. "
+            "A recent worker deploy and elevated worker queue depth make the worker look like a plausible suspect. "
+            "The agent must localize to the gateway, roll back its deploy, and verify recovery without unnecessary restarts."
+        ),
+        "root_cause": "A bad api-gateway auth-middleware rollout is rejecting valid logins.",
+        "optimal_ticks": 8,
+        "max_ticks": 10,
+        "critical_service_weights": {
+            "worker": 0.15,
+            "database": 0.15,
+            "api-gateway": 0.70,
+            "cache": 0.0,
+        },
+        "reward_config": {
+            "step_cost": 0.01,
+            "redundant_action_penalty": 0.02,
+            "unsafe_action_penalty": 0.12,
+            "premature_resolution_penalty": 0.3,
+            "successful_resolution_bonus": 0.3,
+            "hypothesis_bonus_scale": 0.12,
+            "forbidden_reward_sources": [
+                "evidence_discovery",
+                "query_success",
+                "unlock_events",
+                "stage_advancement",
+                "patch_id_selection",
+            ],
+        },
+        "initial_services": {
+            "api-gateway": {
+                "status": "degraded",
+                "cpu_pct": 38.0,
+                "memory_pct": 42.0,
+                "error_rate_pct": 41.0,
+                "latency_ms": 180.0,
+            },
+            "cache": {
+                "status": "healthy",
+                "cpu_pct": 17.0,
+                "memory_pct": 23.0,
+                "error_rate_pct": 0.0,
+                "latency_ms": 12.0,
+            },
+            "database": {
+                "status": "healthy",
+                "cpu_pct": 38.0,
+                "memory_pct": 41.0,
+                "error_rate_pct": 1.0,
+                "latency_ms": 28.0,
+            },
+            "worker": {
+                "status": "degraded",
+                "cpu_pct": 63.0,
+                "memory_pct": 48.0,
+                "error_rate_pct": 4.0,
+                "latency_ms": 220.0,
+            },
+        },
+        "initial_alerts": [
+            {
+                "service": "api-gateway",
+                "severity": "critical",
+                "message": "Gateway is returning 401 on ~40% of valid login attempts.",
+            },
+            {
+                "service": "worker",
+                "severity": "warning",
+                "message": "Worker queue depth is elevated from the retry storm upstream.",
+            },
+        ],
+        "logs": {
+            "api-gateway": (
+                "Gateway logs show auth-middleware rejecting tokens with valid signatures. "
+                "Rejection rate started exactly at the gateway@2026.04.24-auth rollout boundary."
+            ),
+            "cache": "Cache hit ratio stable and unrelated.",
+            "database": "Database logs are clean; no increase in errors or latency.",
+            "worker": (
+                "Worker logs show client-side retry storms triggered by upstream 401s, not local faults. "
+                "Worker deploy worker@2026.04.24-hotfix is a log-format tweak and does not touch auth."
+            ),
+        },
+        "metrics": {
+            "api-gateway": {
+                "error_rate": "Gateway error rate is 41%, dominated by 401 responses (auth failures).",
+                "latency": "Gateway latency is normal — errors are fast rejections, not timeouts.",
+            },
+            "database": {
+                "cpu": "Database CPU is 38% (normal).",
+                "error_rate": "Database error rate is ~1% and flat.",
+            },
+            "worker": {
+                "cpu": "Worker CPU is 63% from retry volume, not workload.",
+                "error_rate": "Worker errors are reactive retries, not primary failures.",
+            },
+        },
+        "dependencies": {
+            "api-gateway": "api-gateway -> (auth) -> worker -> database",
+            "worker": "worker -> database",
+            "database": "database is healthy; it is not on the fault path",
+        },
+        "deploy_history": {
+            "api-gateway": "Rolled out gateway@2026.04.24-auth 9 minutes ago (auth middleware rewrite).",
+            "cache": "No cache deploys in the last 24h.",
+            "database": "No database deploys in the last 24h.",
+            "worker": "Rolled out worker@2026.04.24-hotfix 18 minutes ago (log-format tweak, no auth changes).",
+        },
+        "checks": {
+            "database_recovery": "Confirms the database is healthy (always healthy in this scenario).",
+            "end_to_end": "Confirms gateway login traffic succeeds end-to-end.",
+        },
+        "truth": {
+            "root_cause": "api_gateway_fault",
+            "affected_services": ["api-gateway", "worker"],
+            "best_next_action": "rollback_deploy",
+        },
+        "remediation_recipe": {
+            "rollback_target": "api-gateway",
+            "restart_target": None,
+            "isolate_target": "api-gateway",
+            "restart_requires_cause_removed": True,
+            "incident_driver": "api-gateway",
+            "resolution_check": "end_to_end",
+        },
+        "post_rollback_services": {
+            "api-gateway": {"status": "healthy", "cpu_pct": 30.0, "memory_pct": 34.0, "error_rate_pct": 1.0, "latency_ms": 38.0},
+            "worker": {"status": "healthy", "cpu_pct": 34.0, "memory_pct": 36.0, "error_rate_pct": 1.0, "latency_ms": 52.0},
+        },
+        "post_rollback_user_impact": 0.12,
+        "post_rollback_slo_burn": 0.18,
+        "post_restart_services": {},
+        "post_restart_user_impact": 0.12,
+        "post_restart_slo_burn": 0.18,
+        "post_isolate_services": {
+            "api-gateway": {"status": "isolated", "cpu_pct": 6.0, "memory_pct": 14.0, "error_rate_pct": 0.0, "latency_ms": 0.0},
+        },
+        "post_isolate_user_impact": 0.55,
+        "post_isolate_slo_burn": 0.60,
+        "degraded_services": {
+            "api-gateway": {"status": "degraded", "cpu_pct": 38.0, "memory_pct": 42.0, "error_rate_pct": 41.0, "latency_ms": 180.0},
+            "worker": {"status": "degraded", "cpu_pct": 63.0, "memory_pct": 48.0, "error_rate_pct": 4.0, "latency_ms": 220.0},
+        },
+        "degraded_user_impact": 0.65,
+        "degraded_slo_burn": 0.72,
+        "failure_messages": {
+            "wrong_rollback_target": "The worker deploy is a log-format tweak and is not on the auth fault path.",
+            "low_value_restart": "Restarting a service does not fix a config/middleware regression rolled out as a deploy.",
+            "premature_restart": "Restarting before rolling back the gateway auth change just restarts the same bad middleware.",
+            "wrong_isolation_target": "Isolating workers or database cuts healthy traffic without fixing the gateway auth fault.",
+        },
+        "difficulty_knobs": {
+            "noise_services": ["stripe-webhook", "image-cdn", "feature-flags"],
+            "noise_alerts": [
+                {"service": "stripe-webhook", "severity": "warning", "message": "Stripe webhook signing drift warning — known benign noise from clock skew."},
+                {"service": "image-cdn", "severity": "warning", "message": "Image CDN purge lag on asia-east1 edge (unrelated noise)."},
+                {"service": "feature-flags", "severity": "warning", "message": "Feature-flags subscriber reconnected after routine rotation (unrelated noise)."},
+            ],
+            "noise_logs": {
+                "stripe-webhook": "Webhook signature log shows no delivery failures; flagged warnings are clock-skew benign.",
+                "image-cdn": "CDN purge lag is within published SLA; no customer-visible impact.",
+                "feature-flags": "Feature-flags consumer reconnect logs are routine rotation; no delivery loss.",
+            },
+            "blast_radius_budget": 1,
+        },
+    },
 }
+def _stable_rng(*parts: object) -> random.Random:
+    seed_material = "::".join(str(part) for part in parts)
+    digest = hashlib.sha256(seed_material.encode("utf-8")).hexdigest()
+    return random.Random(int(digest[:16], 16))
+def _clamp(value: float, lower: float, upper: float) -> float:
+    return max(lower, min(upper, value))
+def _jitter_metric(value: float, *, rng: random.Random, spread: float, floor: float = 0.0, ceil: float = 100.0) -> float:
+    if value == 0.0:
+        return 0.0
+    delta = value * rng.uniform(-spread, spread)
+    return round(_clamp(value + delta, floor, ceil), 1)
+def _jitter_latency(value: float, *, rng: random.Random, spread: float) -> float:
+    if value == 0.0:
+        return 0.0
+    delta = value * rng.uniform(-spread, spread)
+    return round(max(0.0, value + delta), 1)
+def _mutate_service_table(table: dict[str, dict[str, Any]], *, rng: random.Random, spread: float) -> dict[str, dict[str, Any]]:
+    mutated: dict[str, dict[str, Any]] = {}
+    for service_name, payload in table.items():
+        item = dict(payload)
+        item["cpu_pct"] = _jitter_metric(float(item["cpu_pct"]), rng=rng, spread=spread)
+        item["memory_pct"] = _jitter_metric(float(item["memory_pct"]), rng=rng, spread=spread)
+        item["error_rate_pct"] = _jitter_metric(float(item["error_rate_pct"]), rng=rng, spread=spread)
+        item["latency_ms"] = _jitter_latency(float(item["latency_ms"]), rng=rng, spread=spread)
+        mutated[service_name] = item
+    return mutated
+def _mutate_deploy_text(text: str, *, rng: random.Random, service: str) -> str:
+    age_minutes = rng.randint(6, 28)
+    rollout_suffix = f"{service[:3]}{rng.randint(11, 98)}"
+    updated = _MINUTES_AGO_RE.sub(f"{age_minutes} minutes ago", text, count=1)
+    return _ROLLOUT_VERSION_RE.sub(rf"\1{rollout_suffix}", updated, count=1)
+def _mutate_noise_knobs(knobs: dict[str, Any], *, rng: random.Random, variant_index: int) -> dict[str, Any]:
+    mutated = deepcopy(knobs)
+    noise_services = list(mutated.get("noise_services", []))
+    if not noise_services:
+        return mutated
+    rotation = variant_index % len(noise_services)
+    rotated_services = noise_services[rotation:] + noise_services[:rotation]
+    alert_pool = {item["service"]: dict(item) for item in mutated.get("noise_alerts", [])}
+    log_pool = dict(mutated.get("noise_logs", {}))
+    selected_count = min(len(rotated_services), max(1, 1 + (variant_index % len(rotated_services))))
+    selected_services = rotated_services[:selected_count]
+    mutated["noise_services"] = selected_services
+    mutated["noise_alerts"] = [alert_pool[service] for service in selected_services if service in alert_pool]
+    mutated["noise_logs"] = {service: log_pool[service] for service in selected_services if service in log_pool}
+    return mutated
+def _procgen_variant_id(template_id: str, variant_index: int) -> str:
+    return f"{template_id}__p{variant_index + 1:02d}"
+def _materialize_procgen_variant(template_id: str, template: dict[str, Any], *, variant_index: int) -> dict[str, Any]:
+    rng = _stable_rng(template_id, variant_index)
+    spread_by_difficulty = {
+        "easy": 0.05,
+        "medium": 0.08,
+        "hard": 0.10,
+    }
+    spread = spread_by_difficulty.get(template["difficulty"], 0.06)
+    scenario = deepcopy(template)
+    scenario["id"] = _procgen_variant_id(template_id, variant_index)
+    scenario["template_id"] = template_id
+    scenario["is_procgen"] = True
+    scenario["name"] = f"{template['name']} [procgen {variant_index + 1}]"
+    scenario["description"] = (
+        f"{template['description']} "
+        f"Variant {variant_index + 1} reshuffles timing and distractor noise."
+    )
+    for key in (
+        "initial_services",
+        "degraded_services",
+        "post_rollback_services",
+        "post_restart_services",
+        "post_isolate_services",
+    ):
+        scenario[key] = _mutate_service_table(template.get(key, {}), rng=rng, spread=spread)
+    scenario["deploy_history"] = {
+        service: _mutate_deploy_text(text, rng=rng, service=service)
+        for service, text in template.get("deploy_history", {}).items()
+    }
+    scenario["difficulty_knobs"] = _mutate_noise_knobs(template.get("difficulty_knobs", {}), rng=rng, variant_index=variant_index)
+    return scenario
+def _build_scenarios() -> dict[str, dict[str, Any]]:
+    catalog: dict[str, dict[str, Any]] = {}
+    for template_id, scenario in _BASE_SCENARIOS.items():
+        catalog[template_id] = deepcopy(scenario)
+        catalog[template_id]["template_id"] = template_id
+        catalog[template_id]["is_procgen"] = False
+        for variant_index in range(PROCGEN_VARIANTS_PER_TEMPLATE):
+            variant = _materialize_procgen_variant(
+                template_id,
+                catalog[template_id],
+                variant_index=variant_index,
+            )
+            catalog[variant["id"]] = variant
+    return catalog
+SCENARIOS: dict[str, dict[str, Any]] = _build_scenarios()
 _RUNTIME_PROGRESS: dict[str, Any] | None = None
     return deepcopy(SCENARIOS[scenario_id])
+SUPPORTED_DIFFICULTIES: tuple[str, ...] = ("easy", "medium", "hard")
+def scenario_for_difficulty(difficulty: str, seed: int | None = None) -> dict[str, Any]:
+    matches = [
+        scenario
+        for scenario in SCENARIOS.values()
+        if scenario["difficulty"] == difficulty
+    ]
+    if seed is None:
+        for scenario in matches:
+            if not scenario.get("is_procgen", False):
+                return deepcopy(scenario)
+    if matches:
+        return deepcopy(matches[(seed or 0) % len(matches)])
     raise ValueError(f"Unknown difficulty {difficulty!r}")
+def list_scenarios(difficulty: str | None = None, include_procgen: bool = True) -> ScenarioCatalog:
+    if difficulty is not None and difficulty not in SUPPORTED_DIFFICULTIES:
         raise ValueError(f"Unknown difficulty {difficulty!r}")
     scenarios = [
         ScenarioSummary(
             optimal_ticks=scenario["optimal_ticks"],
         )
         for scenario in SCENARIOS.values()
+        if (difficulty is None or scenario["difficulty"] == difficulty)
+        and (include_procgen or not scenario.get("is_procgen", False))
     ]
     return ScenarioCatalog(
         default_scenario_id=DEFAULT_SCENARIO_ID,
+        available_difficulties=list(SUPPORTED_DIFFICULTIES),
         filtered_difficulty=difficulty,
         scenarios=scenarios,
     )
+def _worker_cascade_baseline() -> list[BaselineStep]:
     return [
         BaselineStep(
             action=UnifiedIncidentAction(action_type="query_deploys", service="worker"),
     ]
+def _db_config_rollout_baseline() -> list[BaselineStep]:
+    return [
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_logs", service="database"),
+            rationale="Database is the loudest alert; inspect logs for the actual error signature.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_deploys", service="database"),
+            rationale="Pool-acquire errors suggest a config change; check recent database rollouts.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_metrics", service="database", metric="error_rate"),
+            rationale="Confirm the error pattern is pool exhaustion rather than compute overload.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_logs", service="worker"),
+            rationale="Rule out the decoy worker deploy by reading worker logs directly.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(
+                action_type="submit_hypothesis",
+                hypothesis={
+                    "root_cause": "database_only_failure",
+                    "affected_services": ["database", "api-gateway", "worker"],
+                    "confidence": 0.8,
+                    "recommended_next_action": "rollback_deploy",
+                },
+            ),
+            rationale="Localize the fault to the database config before remediating.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="rollback_deploy", service="database"),
+            rationale="Roll back the offending database config rollout.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="restart_service", service="database"),
+            rationale="Restart the database cleanly against the restored pool config.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="run_check", check_name="database_recovery"),
+            rationale="Verify database pool health and write latency are back within SLO.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"),
+            rationale="Verify gateway write-path traffic succeeds end-to-end.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="declare_resolved"),
+            rationale="Declare resolved only after objective checks pass.",
+        ),
+    ]
+def _gateway_auth_rollout_baseline() -> list[BaselineStep]:
+    return [
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_logs", service="api-gateway"),
+            rationale="Gateway is rejecting logins; read gateway logs to localize the rejection class.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_deploys", service="api-gateway"),
+            rationale="Login rejection aligns with a recent auth middleware rollout; confirm deploy timing.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="query_deploys", service="worker"),
+            rationale="Rule out the worker deploy explicitly rather than assuming.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(
+                action_type="submit_hypothesis",
+                hypothesis={
+                    "root_cause": "api_gateway_fault",
+                    "affected_services": ["api-gateway", "worker"],
+                    "confidence": 0.85,
+                    "recommended_next_action": "rollback_deploy",
+                },
+            ),
+            rationale="Commit a calibrated hypothesis localizing to the gateway auth rollout.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="rollback_deploy", service="api-gateway"),
+            rationale="Roll back the bad auth middleware rollout; no restart needed.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"),
+            rationale="Verify that gateway login traffic now succeeds end-to-end.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="run_check", check_name="database_recovery"),
+            rationale="Confirm the database is (and stayed) healthy throughout.",
+        ),
+        BaselineStep(
+            action=UnifiedIncidentAction(action_type="declare_resolved"),
+            rationale="Declare resolved only after objective checks pass.",
+        ),
+    ]
+_BASELINE_BUILDERS = {
+    "worker_deploy_cascade": _worker_cascade_baseline,
+    "db_config_rollout": _db_config_rollout_baseline,
+    "gateway_auth_rollout": _gateway_auth_rollout_baseline,
+}
+def _baseline_actions(scenario_id: str) -> list[BaselineStep]:
+    template_id = SCENARIOS[scenario_id].get("template_id", scenario_id)
+    builder = _BASELINE_BUILDERS.get(template_id)
+    if builder is None:
+        raise ValueError(f"No baseline for scenario_id {scenario_id!r}")
+    return builder()
+def list_baselines(scenario_id: str | None = None, include_procgen: bool = True) -> BaselineCatalog:
+    if scenario_id is not None:
+        if scenario_id not in SCENARIOS:
+            raise ValueError(f"Unknown scenario_id {scenario_id!r}")
+        scenario_ids = [scenario_id]
+    else:
+        scenario_ids = [
+            current_id
+            for current_id, scenario in SCENARIOS.items()
+            if include_procgen or not scenario.get("is_procgen", False)
+        ]
     baselines = [
         BaselineDefinition(
             scenario_id=current_id,
             name="deterministic-remediation-baseline",
+            description=SCENARIOS[current_id]["description"],
             optimal_ticks=SCENARIOS[current_id]["optimal_ticks"],
             actions=_baseline_actions(current_id),
         )

unified_incident_env/server/environment.py CHANGED Viewed

@@ -58,7 +58,7 @@ STATUS_VALUES = {
 class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncidentObservation, UnifiedIncidentState]):
     """A bounded-action incident diagnosis and safe remediation environment."""
-    SUPPORTS_CONCURRENT_SESSIONS = False
     def __init__(self) -> None:
         super().__init__()
@@ -78,13 +78,12 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
         )
     def reset(self, seed: int | None = None, episode_id: str | None = None, **kwargs: Any) -> UnifiedIncidentObservation:
-        del seed
         scenario_id = kwargs.get("scenario_id")
         difficulty = kwargs.get("difficulty")
         if scenario_id:
             scenario = get_scenario(scenario_id)
         elif difficulty:
-            scenario = scenario_for_difficulty(difficulty)
         else:
             scenario = get_scenario(DEFAULT_SCENARIO_ID)
         self._episode = self._make_episode(scenario, episode_id=episode_id)
@@ -204,6 +203,11 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
             "database_recovery": CheckResult(name="database_recovery", passed=False, detail="Database recovery has not been verified yet."),
             "end_to_end": CheckResult(name="end_to_end", passed=False, detail="End-to-end health has not been verified yet."),
         }
         return {
             "episode_id": episode_id or str(uuid.uuid4()),
             "scenario": scenario,
@@ -213,16 +217,16 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
             "difficulty": scenario["difficulty"],
             "services": services,
             "alerts": [Alert(**payload) for payload in scenario["initial_alerts"]],
             "discovered_evidence": [],
             "evidence_seen": set(),
-            "recent_deploys": [scenario["deploy_history"]["worker"]],
             "checks": checks,
-            "user_impact": 0.82,
-            "slo_burn_rate": 0.91,
             "containment_applied": False,
             "cause_removed": False,
-            "worker_isolated": False,
-            "worker_version": "worker@2026.04.23-bad",
             "hypothesis_seen": set(),
             "failure_type": None,
             "why_failed": None,
@@ -233,12 +237,16 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
             "workflow_stage": "triage",
             "cumulative_reward": 0.0,
             "wasteful_ticks": 0,
             "score_breakdown": {
                 "recovery_score": 0.0,
                 "containment_score": 0.0,
                 "verification_score": 0.0,
                 "impact_score": 0.0,
-                "efficiency_score": 0.10,
                 "final_score": 0.10,
             },
             "final_score": 0.10,
@@ -246,20 +254,44 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
             "done": False,
         }
     def _query_logs(self, service: str | None) -> str:
         assert service is not None
         return self._episode["scenario"]["logs"][service]
     def _query_metrics(self, service: str | None, metric: str | None) -> str:
         assert service is not None and metric is not None
         return self._episode["scenario"]["metrics"][service][metric]
     def _query_dependencies(self, service: str | None) -> str:
         assert service is not None
         return self._episode["scenario"]["dependencies"][service]
     def _query_deploys(self, service: str | None) -> str:
         assert service is not None
         return self._episode["scenario"]["deploy_history"][service]
     def _submit_hypothesis(self, action: UnifiedIncidentAction) -> tuple[float, bool, str]:
@@ -280,108 +312,112 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
         reward = (0.04 * cause_match) + (0.03 * service_match) + (0.03 * action_quality) + (0.02 * calibration)
         return round(reward, 4), True, "Hypothesis recorded. Reward reflects root-cause accuracy, service localization, confidence calibration, and next-action quality."
     def _rollback_deploy(self, service: str | None) -> tuple[bool, float, str]:
         assert service is not None
-        if service != "worker":
-            self._set_failure("wrong_remediation_target", "Rolling back a service without a causal link wastes time and risk.")
             return False, self._unsafe_penalty(), f"Rollback on {service} did not address the incident."
         if self._episode["cause_removed"]:
-            return False, 0.0, "Worker deploy is already rolled back."
         self._episode["cause_removed"] = True
         self._episode["containment_applied"] = True
-        self._episode["worker_version"] = "worker@2026.04.23-good"
-        self._episode["services"]["worker"] = ServiceHealth(
-            name="worker",
-            status="healthy",
-            cpu_pct=32.0,
-            memory_pct=37.0,
-            error_rate_pct=2.0,
-            latency_ms=40.0,
-        )
-        self._episode["user_impact"] = min(self._episode["user_impact"], 0.55)
-        self._episode["slo_burn_rate"] = min(self._episode["slo_burn_rate"], 0.58)
-        return True, 0.0, "Rolled back the worker deploy; downstream load should now stabilize after dependent services recover."
     def _restart_service(self, service: str | None) -> tuple[bool, float, str]:
         assert service is not None
-        if service == "database":
-            if not self._episode["cause_removed"]:
-                self._set_failure("premature_restart", "Restarting the database before removing the trigger only causes another crash loop.")
-                return False, self._unsafe_penalty(), "Database restart failed because the worker is still driving overload."
-            self._episode["services"]["database"] = ServiceHealth(
-                name="database",
-                status="healthy",
-                cpu_pct=34.0,
-                memory_pct=39.0,
-                error_rate_pct=0.0,
-                latency_ms=22.0,
             )
-            self._episode["services"]["api-gateway"] = ServiceHealth(
-                name="api-gateway",
-                status="healthy",
-                cpu_pct=28.0,
-                memory_pct=31.0,
-                error_rate_pct=0.0,
-                latency_ms=38.0,
             )
-            self._episode["user_impact"] = 0.14
-            self._episode["slo_burn_rate"] = 0.18
-            return True, 0.0, "Database restarted cleanly after the bad deploy was rolled back."
-        self._set_failure("low_value_restart", f"Restarting {service} is not the safe next remediation step for this incident.")
-        return False, self._unsafe_penalty() / 2, f"Restarting {service} had little or no positive effect."
     def _isolate_service(self, service: str | None) -> tuple[bool, float, str]:
         assert service is not None
-        if service != "worker":
-            self._set_failure("wrong_isolation_target", f"Isolating {service} does not contain the dominant failure path.")
             return False, self._unsafe_penalty() / 2, f"Isolation of {service} did not materially reduce blast radius."
-        if self._episode["worker_isolated"]:
-            return False, 0.0, "Worker is already isolated."
-        self._episode["worker_isolated"] = True
         self._episode["containment_applied"] = True
-        self._episode["services"]["worker"] = ServiceHealth(
-            name="worker",
-            status="isolated",
-            cpu_pct=8.0,
-            memory_pct=18.0,
-            error_rate_pct=0.0,
-            latency_ms=0.0,
-        )
-        self._episode["services"]["database"] = ServiceHealth(
-            name="database",
-            status="healthy",
-            cpu_pct=41.0,
-            memory_pct=46.0,
-            error_rate_pct=0.0,
-            latency_ms=26.0,
-        )
-        self._episode["services"]["api-gateway"] = ServiceHealth(
-            name="api-gateway",
-            status="degraded",
-            cpu_pct=34.0,
-            memory_pct=33.0,
-            error_rate_pct=7.0,
-            latency_ms=91.0,
-        )
-        self._episode["user_impact"] = 0.45
-        self._episode["slo_burn_rate"] = 0.47
-        return True, 0.0, "Worker isolated. Blast radius shrank, but end-to-end service remains degraded until the worker path is restored safely."
     def _run_check(self, check_name: str | None) -> tuple[str, bool, str]:
         assert check_name is not None
         if check_name == "database_recovery":
-            passed = self._episode["services"]["database"].status == "healthy" and self._episode["cause_removed"]
             detail = (
-                "Database is healthy and no longer crashing."
                 if passed
                 else "Database is still unstable or the triggering cause is still present."
             )
         else:
             passed = (
-                self._episode["services"]["database"].status == "healthy"
-                and self._episode["services"]["api-gateway"].status == "healthy"
-                and self._episode["cause_removed"]
-                and not self._episode["worker_isolated"]
             )
             detail = (
                 "End-to-end login traffic is healthy."
@@ -394,7 +430,8 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
     def _declare_resolved(self) -> tuple[bool, float, float, str]:
         checks = self._episode["checks"]
-        safe_to_resolve = checks["database_recovery"].passed and checks["end_to_end"].passed
         if not safe_to_resolve:
             self._set_failure("premature_resolution", "The incident is not verified as resolved yet.")
             return False, self._episode["scenario"]["reward_config"]["premature_resolution_penalty"], 0.0, "Resolution declaration rejected: required checks have not passed."
@@ -417,34 +454,14 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
         self._episode["why_failed"] = why_failed
     def _advance_world(self) -> None:
-        if not self._episode["cause_removed"] and not self._episode["worker_isolated"]:
-            self._episode["services"]["worker"] = ServiceHealth(
-                name="worker",
-                status="degraded",
-                cpu_pct=88.0,
-                memory_pct=71.0,
-                error_rate_pct=19.0,
-                latency_ms=420.0,
-            )
-            self._episode["services"]["database"] = ServiceHealth(
-                name="database",
-                status="crashed",
-                cpu_pct=99.0,
-                memory_pct=97.0,
-                error_rate_pct=100.0,
-                latency_ms=0.0,
-            )
-            self._episode["services"]["api-gateway"] = ServiceHealth(
-                name="api-gateway",
-                status="degraded",
-                cpu_pct=61.0,
-                memory_pct=38.0,
-                error_rate_pct=24.0,
-                latency_ms=640.0,
-            )
-            self._episode["user_impact"] = max(self._episode["user_impact"], 0.82)
-            self._episode["slo_burn_rate"] = max(self._episode["slo_burn_rate"], 0.91)
-        if self._episode["worker_isolated"] and not self._episode["cause_removed"]:
             self._episode["containment_applied"] = True
         self._episode["workflow_stage"] = self._workflow_stage()
@@ -480,7 +497,7 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
         checks = self._episode["checks"]
         if checks["database_recovery"].passed or checks["end_to_end"].passed:
             return "validation"
-        if self._episode["containment_applied"] or self._episode["cause_removed"] or self._episode["worker_isolated"]:
             return "mitigation"
         return "triage"
@@ -498,12 +515,16 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
             "database_recovery": checks["database_recovery"].passed,
             "end_to_end": checks["end_to_end"].passed,
             "incident_resolved": self._episode["incident_resolved"],
         }
     def _incident_summary(self) -> str:
         return (
-            "Gateway login traffic is failing because the worker is overloading the database after a recent worker deploy. "
-            "Use evidence-gathering actions to diagnose, then choose a safe remediation and verify with explicit checks."
         )
     def _prompt_text(self, tool_output: str | None) -> str:
@@ -520,6 +541,10 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
             lines.extend(f"- [{alert.severity.upper()}] {alert.service}: {alert.message}" for alert in self._episode["alerts"])
         else:
             lines.append("- none")
         lines.extend([
             "",
             "SERVICES:",
@@ -568,6 +593,7 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
             "max_ticks": self._episode["max_ticks"],
             "workflow_stage": self._episode["workflow_stage"],
             "active_alerts": [alert.model_dump() for alert in self._episode["alerts"]],
             "service_health": {name: service.model_dump() for name, service in self._episode["services"].items()},
             "discovered_evidence": list(self._episode["discovered_evidence"]),
             "recent_deploys": list(self._episode["recent_deploys"]),
@@ -584,6 +610,8 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
             "score_breakdown": dict(self._episode["score_breakdown"]),
             "cumulative_reward": self._episode["cumulative_reward"],
             "wasteful_ticks": self._episode["wasteful_ticks"],
             "last_action_result": self._episode["last_action_result"],
             "failure_type": self._episode["failure_type"],
             "why_failed": self._episode["why_failed"],
@@ -598,6 +626,7 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
             difficulty=self._episode["difficulty"],
             workflow_stage=self._episode["workflow_stage"],
             active_alerts=list(self._episode["alerts"]),
             service_health=dict(self._episode["services"]),
             discovered_evidence=list(self._episode["discovered_evidence"]),
             recent_deploys=list(self._episode["recent_deploys"]),
@@ -625,4 +654,6 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
             score_breakdown=dict(self._episode["score_breakdown"]),
             reward=round(reward, 4),
             done=done,
         )

 class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncidentObservation, UnifiedIncidentState]):
     """A bounded-action incident diagnosis and safe remediation environment."""
+    SUPPORTS_CONCURRENT_SESSIONS = True
     def __init__(self) -> None:
         super().__init__()
         )
     def reset(self, seed: int | None = None, episode_id: str | None = None, **kwargs: Any) -> UnifiedIncidentObservation:
         scenario_id = kwargs.get("scenario_id")
         difficulty = kwargs.get("difficulty")
         if scenario_id:
             scenario = get_scenario(scenario_id)
         elif difficulty:
+            scenario = scenario_for_difficulty(difficulty, seed=seed)
         else:
             scenario = get_scenario(DEFAULT_SCENARIO_ID)
         self._episode = self._make_episode(scenario, episode_id=episode_id)
             "database_recovery": CheckResult(name="database_recovery", passed=False, detail="Database recovery has not been verified yet."),
             "end_to_end": CheckResult(name="end_to_end", passed=False, detail="End-to-end health has not been verified yet."),
         }
+        recipe = scenario.get("remediation_recipe", {})
+        rollback_target = recipe.get("rollback_target", "worker")
+        recent_deploy_service = rollback_target if rollback_target in scenario["deploy_history"] else "worker"
+        knobs = scenario.get("difficulty_knobs", {})
+        noise_alerts = [Alert(**payload) for payload in knobs.get("noise_alerts", [])]
         return {
             "episode_id": episode_id or str(uuid.uuid4()),
             "scenario": scenario,
             "difficulty": scenario["difficulty"],
             "services": services,
             "alerts": [Alert(**payload) for payload in scenario["initial_alerts"]],
+            "noise_alerts": noise_alerts,
             "discovered_evidence": [],
             "evidence_seen": set(),
+            "recent_deploys": [scenario["deploy_history"].get(recent_deploy_service, "")],
             "checks": checks,
+            "user_impact": scenario.get("degraded_user_impact", 0.82),
+            "slo_burn_rate": scenario.get("degraded_slo_burn", 0.91),
             "containment_applied": False,
             "cause_removed": False,
+            "isolated_service": None,
             "hypothesis_seen": set(),
             "failure_type": None,
             "why_failed": None,
             "workflow_stage": "triage",
             "cumulative_reward": 0.0,
             "wasteful_ticks": 0,
+            "blast_radius": 0,
+            "noise_queries": 0,
             "score_breakdown": {
                 "recovery_score": 0.0,
                 "containment_score": 0.0,
                 "verification_score": 0.0,
                 "impact_score": 0.0,
+                "efficiency_score": 0.05,
+                "speed_bonus": 0.0,
+                "noise_handling_score": 0.05 if knobs.get("noise_services") else 0.0,
                 "final_score": 0.10,
             },
             "final_score": 0.10,
             "done": False,
         }
+    def _noise_knobs(self) -> dict[str, Any]:
+        return self._episode["scenario"].get("difficulty_knobs", {})
+    def _is_noise_service(self, service: str) -> bool:
+        return service in set(self._noise_knobs().get("noise_services", []))
+    def _record_noise_query(self, service: str) -> None:
+        if self._is_noise_service(service):
+            self._episode["noise_queries"] = self._episode.get("noise_queries", 0) + 1
     def _query_logs(self, service: str | None) -> str:
         assert service is not None
+        if self._is_noise_service(service):
+            self._record_noise_query(service)
+            noise_logs = self._noise_knobs().get("noise_logs", {})
+            detail = noise_logs.get(service, f"{service} logs show no incident-correlated regression.")
+            return f"{service}: {detail}"
         return self._episode["scenario"]["logs"][service]
     def _query_metrics(self, service: str | None, metric: str | None) -> str:
         assert service is not None and metric is not None
+        if self._is_noise_service(service):
+            self._record_noise_query(service)
+            return f"{service} {metric} metrics are within ordinary background variance and unrelated to the active incident."
         return self._episode["scenario"]["metrics"][service][metric]
     def _query_dependencies(self, service: str | None) -> str:
         assert service is not None
+        if self._is_noise_service(service):
+            self._record_noise_query(service)
+            return f"{service} is off the primary user-impact path and is not driving the incident."
         return self._episode["scenario"]["dependencies"][service]
     def _query_deploys(self, service: str | None) -> str:
         assert service is not None
+        if self._is_noise_service(service):
+            self._record_noise_query(service)
+            return f"No recent {service} deploy correlates with the active incident timeline."
         return self._episode["scenario"]["deploy_history"][service]
     def _submit_hypothesis(self, action: UnifiedIncidentAction) -> tuple[float, bool, str]:
         reward = (0.04 * cause_match) + (0.03 * service_match) + (0.03 * action_quality) + (0.02 * calibration)
         return round(reward, 4), True, "Hypothesis recorded. Reward reflects root-cause accuracy, service localization, confidence calibration, and next-action quality."
+    def _recipe(self) -> dict[str, Any]:
+        return self._episode["scenario"].get("remediation_recipe", {})
+    def _failure_message(self, key: str, default: str) -> str:
+        return self._episode["scenario"].get("failure_messages", {}).get(key, default)
+    def _apply_service_updates(self, updates: dict[str, dict[str, Any]]) -> None:
+        for name, payload in updates.items():
+            self._episode["services"][name] = ServiceHealth(name=name, **payload)
+    def _bump_blast_radius(self) -> None:
+        self._episode["blast_radius"] = self._episode.get("blast_radius", 0) + 1
     def _rollback_deploy(self, service: str | None) -> tuple[bool, float, str]:
         assert service is not None
+        recipe = self._recipe()
+        rollback_target = recipe.get("rollback_target")
+        if rollback_target is None or service != rollback_target:
+            self._set_failure(
+                "wrong_remediation_target",
+                self._failure_message("wrong_rollback_target", "Rolling back a service without a causal link wastes time and risk."),
+            )
             return False, self._unsafe_penalty(), f"Rollback on {service} did not address the incident."
         if self._episode["cause_removed"]:
+            return False, 0.0, f"{rollback_target} deploy is already rolled back."
         self._episode["cause_removed"] = True
         self._episode["containment_applied"] = True
+        self._bump_blast_radius()
+        self._apply_service_updates(self._episode["scenario"].get("post_rollback_services", {}))
+        scenario = self._episode["scenario"]
+        self._episode["user_impact"] = min(self._episode["user_impact"], scenario.get("post_rollback_user_impact", self._episode["user_impact"]))
+        self._episode["slo_burn_rate"] = min(self._episode["slo_burn_rate"], scenario.get("post_rollback_slo_burn", self._episode["slo_burn_rate"]))
+        return True, 0.0, f"Rolled back the {rollback_target} deploy; the underlying cause is removed."
     def _restart_service(self, service: str | None) -> tuple[bool, float, str]:
         assert service is not None
+        recipe = self._recipe()
+        restart_target = recipe.get("restart_target")
+        if restart_target is None or service != restart_target:
+            self._set_failure(
+                "low_value_restart",
+                self._failure_message("low_value_restart", f"Restarting {service} is not the safe next remediation step for this incident."),
             )
+            return False, self._unsafe_penalty() / 2, f"Restarting {service} had little or no positive effect."
+        if recipe.get("restart_requires_cause_removed", True) and not self._episode["cause_removed"]:
+            self._set_failure(
+                "premature_restart",
+                self._failure_message("premature_restart", f"Restarting {service} before removing the trigger only causes another failure."),
             )
+            return False, self._unsafe_penalty(), f"Restart of {service} failed because the triggering cause is still present."
+        self._bump_blast_radius()
+        self._apply_service_updates(self._episode["scenario"].get("post_restart_services", {}))
+        scenario = self._episode["scenario"]
+        self._episode["user_impact"] = scenario.get("post_restart_user_impact", self._episode["user_impact"])
+        self._episode["slo_burn_rate"] = scenario.get("post_restart_slo_burn", self._episode["slo_burn_rate"])
+        return True, 0.0, f"{service} restarted cleanly after the triggering cause was removed."
     def _isolate_service(self, service: str | None) -> tuple[bool, float, str]:
         assert service is not None
+        recipe = self._recipe()
+        isolate_target = recipe.get("isolate_target")
+        if isolate_target is None or service != isolate_target:
+            self._set_failure(
+                "wrong_isolation_target",
+                self._failure_message("wrong_isolation_target", f"Isolating {service} does not contain the dominant failure path."),
+            )
             return False, self._unsafe_penalty() / 2, f"Isolation of {service} did not materially reduce blast radius."
+        if self._episode["isolated_service"] == isolate_target:
+            return False, 0.0, f"{isolate_target} is already isolated."
+        self._episode["isolated_service"] = isolate_target
         self._episode["containment_applied"] = True
+        self._bump_blast_radius()
+        self._apply_service_updates(self._episode["scenario"].get("post_isolate_services", {}))
+        scenario = self._episode["scenario"]
+        self._episode["user_impact"] = scenario.get("post_isolate_user_impact", self._episode["user_impact"])
+        self._episode["slo_burn_rate"] = scenario.get("post_isolate_slo_burn", self._episode["slo_burn_rate"])
+        return True, 0.0, f"{isolate_target} isolated. Blast radius shrank, but full resolution still requires addressing the root cause."
     def _run_check(self, check_name: str | None) -> tuple[str, bool, str]:
         assert check_name is not None
+        recipe = self._recipe()
+        isolated = self._episode["isolated_service"]
+        cause_removed = self._episode["cause_removed"]
+        services = self._episode["services"]
         if check_name == "database_recovery":
+            db_healthy = services["database"].status == "healthy"
+            incident_driver = recipe.get("incident_driver")
+            if incident_driver in {"worker", "database"}:
+                passed = db_healthy and cause_removed
+            else:
+                passed = db_healthy
             detail = (
+                "Database is healthy and no longer failing."
                 if passed
                 else "Database is still unstable or the triggering cause is still present."
             )
         else:
+            gateway_healthy = services["api-gateway"].status == "healthy"
+            db_healthy = services["database"].status == "healthy"
+            worker_healthy = services["worker"].status == "healthy"
             passed = (
+                gateway_healthy
+                and db_healthy
+                and worker_healthy
+                and cause_removed
+                and isolated is None
             )
             detail = (
                 "End-to-end login traffic is healthy."
     def _declare_resolved(self) -> tuple[bool, float, float, str]:
         checks = self._episode["checks"]
+        resolution_check = self._recipe().get("resolution_check", "end_to_end")
+        safe_to_resolve = bool(checks.get(resolution_check) and checks[resolution_check].passed)
         if not safe_to_resolve:
             self._set_failure("premature_resolution", "The incident is not verified as resolved yet.")
             return False, self._episode["scenario"]["reward_config"]["premature_resolution_penalty"], 0.0, "Resolution declaration rejected: required checks have not passed."
         self._episode["why_failed"] = why_failed
     def _advance_world(self) -> None:
+        cause_removed = self._episode["cause_removed"]
+        isolated = self._episode["isolated_service"]
+        if not cause_removed and isolated is None:
+            self._apply_service_updates(self._episode["scenario"].get("degraded_services", {}))
+            scenario = self._episode["scenario"]
+            self._episode["user_impact"] = max(self._episode["user_impact"], scenario.get("degraded_user_impact", self._episode["user_impact"]))
+            self._episode["slo_burn_rate"] = max(self._episode["slo_burn_rate"], scenario.get("degraded_slo_burn", self._episode["slo_burn_rate"]))
+        if isolated is not None and not cause_removed:
             self._episode["containment_applied"] = True
         self._episode["workflow_stage"] = self._workflow_stage()
         checks = self._episode["checks"]
         if checks["database_recovery"].passed or checks["end_to_end"].passed:
             return "validation"
+        if self._episode["containment_applied"] or self._episode["cause_removed"] or self._episode["isolated_service"] is not None:
             return "mitigation"
         return "triage"
             "database_recovery": checks["database_recovery"].passed,
             "end_to_end": checks["end_to_end"].passed,
             "incident_resolved": self._episode["incident_resolved"],
+            "isolation_applied": self._episode["isolated_service"] is not None,
         }
     def _incident_summary(self) -> str:
+        description = self._episode["scenario"].get("description")
+        if description:
+            return description
         return (
+            "An incident is degrading user traffic. Use evidence-gathering actions to diagnose, "
+            "then choose a safe remediation and verify with explicit checks."
         )
     def _prompt_text(self, tool_output: str | None) -> str:
             lines.extend(f"- [{alert.severity.upper()}] {alert.service}: {alert.message}" for alert in self._episode["alerts"])
         else:
             lines.append("- none")
+        noise = self._episode.get("noise_alerts", [])
+        if noise:
+            lines.extend(["", "NOISE_ALERTS (historically unrelated — resist querying these):"])
+            lines.extend(f"- [{alert.severity.upper()}] {alert.service}: {alert.message}" for alert in noise)
         lines.extend([
             "",
             "SERVICES:",
             "max_ticks": self._episode["max_ticks"],
             "workflow_stage": self._episode["workflow_stage"],
             "active_alerts": [alert.model_dump() for alert in self._episode["alerts"]],
+            "noise_alerts": [alert.model_dump() for alert in self._episode.get("noise_alerts", [])],
             "service_health": {name: service.model_dump() for name, service in self._episode["services"].items()},
             "discovered_evidence": list(self._episode["discovered_evidence"]),
             "recent_deploys": list(self._episode["recent_deploys"]),
             "score_breakdown": dict(self._episode["score_breakdown"]),
             "cumulative_reward": self._episode["cumulative_reward"],
             "wasteful_ticks": self._episode["wasteful_ticks"],
+            "blast_radius": self._episode.get("blast_radius", 0),
+            "noise_queries": self._episode.get("noise_queries", 0),
             "last_action_result": self._episode["last_action_result"],
             "failure_type": self._episode["failure_type"],
             "why_failed": self._episode["why_failed"],
             difficulty=self._episode["difficulty"],
             workflow_stage=self._episode["workflow_stage"],
             active_alerts=list(self._episode["alerts"]),
+            noise_alerts=list(self._episode.get("noise_alerts", [])),
             service_health=dict(self._episode["services"]),
             discovered_evidence=list(self._episode["discovered_evidence"]),
             recent_deploys=list(self._episode["recent_deploys"]),
             score_breakdown=dict(self._episode["score_breakdown"]),
             reward=round(reward, 4),
             done=done,
+            blast_radius=int(self._episode.get("blast_radius", 0)),
+            noise_queries=int(self._episode.get("noise_queries", 0)),
         )

unified_incident_env/server/grader.py CHANGED Viewed

@@ -24,7 +24,23 @@ def _service_score(status: str) -> float:
 class UnifiedIncidentGrader:
-    """Deterministic scorer focused on executed effects, not scripted clues."""
     def compute_breakdown(
         self,
@@ -33,33 +49,64 @@ class UnifiedIncidentGrader:
     ) -> dict[str, float]:
         services = state.get("service_health", {})
         weights = scenario["critical_service_weights"]
-        recovery_score = round(
-            sum(
-                weights.get(service, 0.0) * _service_score((services.get(service) or {}).get("status", "crashed"))
-                for service in weights
-            ),
-            4,
         )
-        containment_score = 0.2 if state.get("containment_applied") else 0.0
-        if state.get("containment_applied") and (services.get("worker") or {}).get("status") == "healthy":
-            containment_score = 0.3
         checks = {item.get("name"): bool(item.get("passed")) for item in state.get("checks", [])}
         verification_score = 0.0
         if checks.get("database_recovery"):
-            verification_score += 0.15
         if checks.get("end_to_end"):
-            verification_score += 0.2
         user_impact = float(state.get("user_impact", 1.0))
-        impact_score = round(max(0.0, 0.15 * (1.0 - user_impact)), 4)
         wasteful_ticks = int(state.get("wasteful_ticks", 0))
-        efficiency_score = round(max(0.0, 0.10 - (0.01 * wasteful_ticks)), 4)
         final_score = _strict_public_score(
-            recovery_score + containment_score + verification_score + impact_score + efficiency_score
         )
         return {
@@ -68,6 +115,8 @@ class UnifiedIncidentGrader:
             "verification_score": round(verification_score, 4),
             "impact_score": impact_score,
             "efficiency_score": efficiency_score,
             "final_score": final_score,
         }
@@ -88,7 +137,7 @@ class UnifiedIncidentGrader:
                     if state.get("containment_applied")
                     else "The root cause is still active or only partially contained."
                 ),
-                weight=0.30,
             ),
             GraderCheck(
                 name="database_recovery",
@@ -98,7 +147,7 @@ class UnifiedIncidentGrader:
                     if checks.get("database_recovery")
                     else "The database recovery check has not passed yet."
                 ),
-                weight=0.20,
             ),
             GraderCheck(
                 name="end_to_end_check",
@@ -112,10 +161,10 @@ class UnifiedIncidentGrader:
             ),
             GraderCheck(
                 name="critical_services_recovered",
-                passed=breakdown["recovery_score"] >= 0.8,
                 detail=(
                     "Critical-path services are recovered."
-                    if breakdown["recovery_score"] >= 0.8
                     else "Critical-path services are still degraded or crashed."
                 ),
                 weight=0.20,
@@ -130,6 +179,26 @@ class UnifiedIncidentGrader:
                 ),
                 weight=0.10,
             ),
         ]
         return GraderReport(
             scenario_id=scenario["id"],

 class UnifiedIncidentGrader:
+    """Deterministic scorer focused on executed effects, not scripted clues.
+    Hardened schedule (post Track-A headroom patch):
+    - recovery       0.00 – 0.25
+    - containment    0.00 – 0.15
+    - verification   0.00 – 0.20
+    - impact         0.00 – 0.05
+    - efficiency     0.00 – 0.05
+    - speed_bonus    0.00 – 0.10    (positive only when faster than optimal)
+    - noise_handling 0.00 – 0.05    (penalizes querying noise services)
+    Scripted deterministic baseline (which matches optimal_ticks exactly and
+    avoids noise queries) caps at ~0.70. Headroom 0.70 → 0.85 is reachable only
+    by an agent that (a) is strictly faster than optimal and (b) touches zero
+    noise services. That's the training target.
+    """
     def compute_breakdown(
         self,
     ) -> dict[str, float]:
         services = state.get("service_health", {})
         weights = scenario["critical_service_weights"]
+        recovery_raw = sum(
+            weights.get(service, 0.0) * _service_score((services.get(service) or {}).get("status", "crashed"))
+            for service in weights
         )
+        recovery_score = round(0.25 * recovery_raw, 4)
+        contained = bool(state.get("containment_applied"))
+        rollback_target = scenario.get("remediation_recipe", {}).get("rollback_target")
+        rollback_service_healthy = bool(
+            rollback_target and (services.get(rollback_target) or {}).get("status") == "healthy"
+        )
+        if contained and rollback_service_healthy:
+            containment_score = 0.15
+        elif contained:
+            containment_score = 0.10
+        else:
+            containment_score = 0.0
         checks = {item.get("name"): bool(item.get("passed")) for item in state.get("checks", [])}
         verification_score = 0.0
         if checks.get("database_recovery"):
+            verification_score += 0.08
         if checks.get("end_to_end"):
+            verification_score += 0.12
         user_impact = float(state.get("user_impact", 1.0))
+        impact_score = round(max(0.0, 0.05 * (1.0 - user_impact)), 4)
         wasteful_ticks = int(state.get("wasteful_ticks", 0))
+        efficiency_score = round(max(0.0, 0.05 - (0.005 * wasteful_ticks)), 4)
+        # speed_bonus: fully earned only if the agent finishes well under optimal_ticks.
+        optimal_ticks = int(scenario.get("optimal_ticks", 10))
+        current_tick = int(state.get("current_tick", 0))
+        incident_resolved = bool(state.get("incident_resolved"))
+        if incident_resolved and current_tick > 0 and current_tick < optimal_ticks:
+            speed_bonus = round(0.10 * (optimal_ticks - current_tick) / optimal_ticks, 4)
+        elif incident_resolved and current_tick == optimal_ticks:
+            speed_bonus = 0.0
+        else:
+            speed_bonus = 0.0
+        # noise_handling: deduct per query against a noise service, up to the cap of 0.05.
+        noise_services = set(scenario.get("difficulty_knobs", {}).get("noise_services", []))
+        noise_queries = int(state.get("noise_queries", 0))
+        if noise_services:
+            noise_handling_score = round(max(0.0, 0.05 - 0.015 * noise_queries), 4)
+        else:
+            noise_handling_score = 0.0
         final_score = _strict_public_score(
+            recovery_score
+            + containment_score
+            + verification_score
+            + impact_score
+            + efficiency_score
+            + speed_bonus
+            + noise_handling_score
         )
         return {
             "verification_score": round(verification_score, 4),
             "impact_score": impact_score,
             "efficiency_score": efficiency_score,
+            "speed_bonus": speed_bonus,
+            "noise_handling_score": noise_handling_score,
             "final_score": final_score,
         }
                     if state.get("containment_applied")
                     else "The root cause is still active or only partially contained."
                 ),
+                weight=0.20,
             ),
             GraderCheck(
                 name="database_recovery",
                     if checks.get("database_recovery")
                     else "The database recovery check has not passed yet."
                 ),
+                weight=0.15,
             ),
             GraderCheck(
                 name="end_to_end_check",
             ),
             GraderCheck(
                 name="critical_services_recovered",
+                passed=breakdown["recovery_score"] >= 0.20,
                 detail=(
                     "Critical-path services are recovered."
+                    if breakdown["recovery_score"] >= 0.20
                     else "Critical-path services are still degraded or crashed."
                 ),
                 weight=0.20,
                 ),
                 weight=0.10,
             ),
+            GraderCheck(
+                name="speed_bonus_earned",
+                passed=breakdown.get("speed_bonus", 0.0) > 0.0,
+                detail=(
+                    "Resolved faster than optimal_ticks."
+                    if breakdown.get("speed_bonus", 0.0) > 0.0
+                    else "Did not beat optimal tick budget."
+                ),
+                weight=0.10,
+            ),
+            GraderCheck(
+                name="noise_handling",
+                passed=breakdown.get("noise_handling_score", 0.0) >= 0.035,
+                detail=(
+                    "Minimal or no queries against noise services."
+                    if breakdown.get("noise_handling_score", 0.0) >= 0.035
+                    else "Wasted queries on noise services."
+                ),
+                weight=0.05,
+            ),
         ]
         return GraderReport(
             scenario_id=scenario["id"],

unified_incident_env/tests/test_environment.py CHANGED Viewed

@@ -6,7 +6,7 @@ from fastapi.testclient import TestClient
 from unified_incident_env.models import HypothesisPayload, UnifiedIncidentAction
 from unified_incident_env.server import app as app_module
-from unified_incident_env.server.challenge import DEFAULT_SCENARIO_ID, list_baselines
 from unified_incident_env.server.environment import UnifiedIncidentEnvironment
@@ -27,7 +27,7 @@ def test_baseline_resolves_honestly() -> None:
     checks = {check.name: check.passed for check in obs.checks}
     assert checks["database_recovery"] is True
     assert checks["end_to_end"] is True
-    assert obs.final_score > 0.7
 def test_query_deploys_reveals_evidence_but_not_positive_reward() -> None:
@@ -114,12 +114,15 @@ def test_routes_expose_new_catalog_and_status(monkeypatch) -> None:
     assert tasks.status_code == 200
     payload = tasks.json()
     assert payload["default_scenario_id"] == DEFAULT_SCENARIO_ID
-    assert len(payload["scenarios"]) == 1
     baseline = client.get("/baseline")
     assert baseline.status_code == 200
     baseline_payload = baseline.json()
-    assert baseline_payload["baselines"][0]["scenario_id"] == DEFAULT_SCENARIO_ID
     health = client.get("/health")
     assert health.status_code == 200
@@ -130,3 +133,143 @@ def test_routes_expose_new_catalog_and_status(monkeypatch) -> None:
     status_payload = status.json()
     assert status_payload["progress"]["scenario_id"] == DEFAULT_SCENARIO_ID
     assert status_payload["grader"]["score"] > 0.0

 from unified_incident_env.models import HypothesisPayload, UnifiedIncidentAction
 from unified_incident_env.server import app as app_module
+from unified_incident_env.server.challenge import DEFAULT_SCENARIO_ID, SCENARIOS, list_baselines, scenario_for_difficulty
 from unified_incident_env.server.environment import UnifiedIncidentEnvironment
     checks = {check.name: check.passed for check in obs.checks}
     assert checks["database_recovery"] is True
     assert checks["end_to_end"] is True
+    assert obs.final_score > 0.55
 def test_query_deploys_reveals_evidence_but_not_positive_reward() -> None:
     assert tasks.status_code == 200
     payload = tasks.json()
     assert payload["default_scenario_id"] == DEFAULT_SCENARIO_ID
+    scenarios_by_difficulty = {scenario["difficulty"] for scenario in payload["scenarios"]}
+    assert {"easy", "medium", "hard"}.issubset(scenarios_by_difficulty)
+    assert {"easy", "medium", "hard"}.issubset(set(payload["available_difficulties"]))
     baseline = client.get("/baseline")
     assert baseline.status_code == 200
     baseline_payload = baseline.json()
+    baseline_ids = {item["scenario_id"] for item in baseline_payload["baselines"]}
+    assert {"worker_deploy_cascade", "db_config_rollout", "gateway_auth_rollout"}.issubset(baseline_ids)
     health = client.get("/health")
     assert health.status_code == 200
     status_payload = status.json()
     assert status_payload["progress"]["scenario_id"] == DEFAULT_SCENARIO_ID
     assert status_payload["grader"]["score"] > 0.0
+def _run_baseline_for_scenario(scenario_id: str):
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id=scenario_id)
+    last = None
+    for step in list_baselines(scenario_id).baselines[0].actions:
+        last = env.step(step.action)
+    return last
+def test_medium_baseline_resolves_honestly() -> None:
+    obs = _run_baseline_for_scenario("db_config_rollout")
+    assert obs is not None
+    assert obs.done is True
+    assert obs.incident_resolved is True
+    checks = {check.name: check.passed for check in obs.checks}
+    assert checks["database_recovery"] is True
+    assert checks["end_to_end"] is True
+    assert obs.final_score > 0.55
+def test_hard_baseline_resolves_honestly() -> None:
+    obs = _run_baseline_for_scenario("gateway_auth_rollout")
+    assert obs is not None
+    assert obs.done is True
+    assert obs.incident_resolved is True
+    checks = {check.name: check.passed for check in obs.checks}
+    assert checks["end_to_end"] is True
+    assert obs.final_score > 0.55
+def test_medium_wrong_rollback_target_is_penalized() -> None:
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id="db_config_rollout")
+    obs = env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="worker"))
+    assert obs.reward < 0.0
+    assert obs.failure_type == "wrong_remediation_target"
+    assert obs.incident_resolved is False
+def test_hard_wrong_rollback_target_is_penalized() -> None:
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id="gateway_auth_rollout")
+    obs = env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="worker"))
+    assert obs.reward < 0.0
+    assert obs.failure_type == "wrong_remediation_target"
+def test_all_scenarios_expose_noise_alerts() -> None:
+    env = UnifiedIncidentEnvironment()
+    for scenario_id in ("worker_deploy_cascade", "db_config_rollout", "gateway_auth_rollout"):
+        obs = env.reset(scenario_id=scenario_id)
+        assert len(obs.noise_alerts) > 0, f"{scenario_id} should expose noise_alerts"
+        assert all(alert.message for alert in obs.noise_alerts)
+def test_blast_radius_increments_on_mitigations() -> None:
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id="worker_deploy_cascade")
+    obs0 = env.step(UnifiedIncidentAction(action_type="query_logs", service="worker"))
+    assert obs0.blast_radius == 0
+    env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="worker"))
+    obs2 = env.step(UnifiedIncidentAction(action_type="restart_service", service="database"))
+    assert obs2.blast_radius == 2
+def test_baseline_ceiling_is_hardened_below_080() -> None:
+    """Scripted-optimal baseline must not score above ~0.80. Headroom left
+    for a trained agent that earns speed_bonus by finishing faster than
+    optimal_ticks."""
+    for scenario_id in ("worker_deploy_cascade", "db_config_rollout", "gateway_auth_rollout"):
+        obs = _run_baseline_for_scenario(scenario_id)
+        assert obs is not None
+        assert obs.final_score <= 0.80, f"{scenario_id} ceiling {obs.final_score} exceeds headroom budget"
+        assert obs.final_score >= 0.55, f"{scenario_id} ceiling {obs.final_score} is too low; env is unsolvable"
+def test_speed_bonus_rewards_finishing_under_optimal_ticks() -> None:
+    """A faster solve that keeps both verification checks should beat the
+    baseline ceiling by the speed_bonus margin. This is the training target
+    — trained agents that skip verification to chase speed should score
+    *lower*, not higher."""
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id="gateway_auth_rollout")
+    # 5-step path: 1 query + 1 rollback + 2 checks + 1 declare. Baseline does 8.
+    env.step(UnifiedIncidentAction(action_type="query_deploys", service="api-gateway"))
+    env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="api-gateway"))
+    env.step(UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"))
+    env.step(UnifiedIncidentAction(action_type="run_check", check_name="database_recovery"))
+    obs = env.step(UnifiedIncidentAction(action_type="declare_resolved"))
+    assert obs.incident_resolved is True
+    assert obs.score_breakdown.get("speed_bonus", 0) > 0.0
+    assert obs.final_score > 0.74, f"Faster solve with full verification should beat baseline, got {obs.final_score}"
+def test_hard_does_not_require_database_recovery_check() -> None:
+    env = UnifiedIncidentEnvironment()
+    env.reset(scenario_id="gateway_auth_rollout")
+    env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="api-gateway"))
+    end_to_end = env.step(UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"))
+    assert any(check.name == "end_to_end" and check.passed for check in end_to_end.checks)
+    resolved = env.step(UnifiedIncidentAction(action_type="declare_resolved"))
+    assert resolved.incident_resolved is True
+def test_procgen_catalog_registers_variants_for_each_template() -> None:
+    procgen_ids = {scenario_id for scenario_id, scenario in SCENARIOS.items() if scenario.get("is_procgen")}
+    assert any(scenario_id.startswith("worker_deploy_cascade__p") for scenario_id in procgen_ids)
+    assert any(scenario_id.startswith("db_config_rollout__p") for scenario_id in procgen_ids)
+    assert any(scenario_id.startswith("gateway_auth_rollout__p") for scenario_id in procgen_ids)
+def test_scenario_for_difficulty_seed_is_deterministic() -> None:
+    first = scenario_for_difficulty("medium", seed=7)
+    second = scenario_for_difficulty("medium", seed=7)
+    assert first["id"] == second["id"]
+    assert first["difficulty"] == "medium"
+def test_procgen_variant_baseline_routes_through_template_builder() -> None:
+    scenario_id = next(
+        current_id
+        for current_id, scenario in SCENARIOS.items()
+        if scenario.get("is_procgen") and scenario.get("template_id") == "db_config_rollout"
+    )
+    obs = _run_baseline_for_scenario(scenario_id)
+    assert obs is not None
+    assert obs.incident_resolved is True
+    assert obs.final_score >= 0.55
+def test_noise_service_queries_are_scored_as_noise() -> None:
+    env = UnifiedIncidentEnvironment()
+    obs = env.reset(scenario_id="gateway_auth_rollout__p01")
+    noise_service = obs.noise_alerts[0].service
+    noise_obs = env.step(UnifiedIncidentAction(action_type="query_logs", service=noise_service))
+    assert noise_obs.noise_queries == 1
+    assert noise_service in (noise_obs.tool_output or "")
+    assert noise_obs.score_breakdown["noise_handling_score"] < 0.05

uv.lock DELETED Viewed

The diff for this file is too large to render. See raw diff