dakshdoesdev Claude Opus 4.7 (1M context) commited on
Commit
0bf41ea
·
1 Parent(s): c8bef53

Harden env + ship Claude skill, OpenClaw-RL shim, training pipeline

Browse files

Env hardening (Stream 1):
- Rebalance grader into 7 dimensions with speed_bonus + noise_handling_score
so scripted baseline ceiling drops 0.99 -> 0.74, creating training headroom
- Surface noise_alerts, blast_radius, noise_queries on observation + state
- Extend ServiceName literal with the noise-service pool

Procedural scenarios:
- Each of the 3 hand-crafted templates spawns 4 jittered variants via seeded
RNG: metric noise, deploy timing, rotated noise-service selection
- 15 scenarios total (5 per difficulty), baseline-resolvable via template_id
dispatch in _baseline_actions

OpenClaw-RL integration shim:
- openclaw_integration/pool_server.py: FastAPI lease-based session server,
asyncio-locked per-lease, TTL reaper for idle cleanup
- openclaw_integration/sre_env_client.py: drop-in shape match with
terminal-rl/env_client.py
- README documents the one-line import patch for terminal-rl/generate.py

Claude Code skill (v0 pitch):
- skill/SKILL.md with investigation methodology + decoy ground truth
- skill/tools/sre_gym_client.py CLI: list / solve / interactive /
record-runbook
- skill/verified-runbooks/ seeded with clean traces of all 3 templates

Training pipeline:
- train/sanity_run.ipynb: Colab-ready Qwen3.5-4B (Qwen3-4B fallback) Unsloth
LoRA SFT dry-run, 200 toy steps, wandb
- train/collect_trajectories.py: parallel async harness with anthropic +
heuristic drivers, uses UnifiedIncidentEnv WebSocket client for state
persistence
- train/requirements-train.txt: pinned Unsloth + TRL + wandb + anthropic

Demo + deploy:
- demo/run_demo.sh + pitch.md: 60-second demo script, 3 solves + runbook
accumulation
- deploy/push_to_hf.sh: HF Space deploy helper (env vars: HF_TOKEN, HF_SPACE_ID)
- README rewritten to lead with the 30-second install + architecture diagram
- openenv.yaml: difficulties [easy, medium, hard]; space_id dakshdoesdev/sre-gym

Test suite: 21 -> 29 passing. openenv validate green. Live Space:
https://dakshdoesdev-sre-gym.hf.space

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

.gitignore CHANGED
@@ -7,3 +7,6 @@ learning_curve.png
7
  .codex/
8
  outputs/
9
  AGENTS.md
 
 
 
 
7
  .codex/
8
  outputs/
9
  AGENTS.md
10
+ .sisyphus/
11
+ *.egg-info/
12
+ uv.lock
.sisyphus/plans/reward-redesign.md DELETED
@@ -1,609 +0,0 @@
1
- # Reward Redesign for Unified Incident Env
2
-
3
- ## TL;DR
4
- > **Summary**: Replace breadcrumb-based rewards with a world-state-based reward system: normalized step cost, incident-health delta shaping, a tiny non-farmable hypothesis-quality bonus, and terminal bonuses/penalties tied to verified containment and recovery. Keep the public deterministic benchmark score separate from training reward, but remove breadcrumb terms from both.
5
- > **Deliverables**:
6
- > - Reworked training-time step reward in `unified_incident_env/server/environment.py`
7
- > - Reworked public deterministic score in `unified_incident_env/server/grader.py`
8
- > - Structured hypothesis payload on `classify_vulnerability`
9
- > - Scenario-authored critical-path service weights and reward config in `server/challenge.py`
10
- > - Updated prompts/inference/tests for the structured hypothesis contract
11
- > - Regression tests proving breadcrumb rewards are gone and world-improving actions dominate
12
- > **Effort**: Large
13
- > **Parallel**: YES - 4 waves
14
- > **Critical Path**: Task 1 → Task 2 → Task 3 → Task 6
15
-
16
- ## Context
17
- ### Original Request
18
- Redesign the reward system so points come from world improvement, step cost, and small calibrated hypothesis quality rather than from the environment revealing the “correct branch.” Keep the env compatible with Gymnasium/OpenEnv-style RL where every `step(action)` returns reward, but tie reward to state-transition quality rather than clue clicks.
19
-
20
- ### Interview Summary
21
- - Reward must still be emitted on every step.
22
- - Investigation actions should mostly cost time and should not directly reward clue discovery.
23
- - Hypothesis actions can receive a small score for decision quality: root-cause accuracy, service localization, confidence calibration, and recommended next action quality.
24
- - Big rewards should remain tied to actual containment, verified recovery, and correct final resolution.
25
- - Reward shaping should follow the spirit of potential-based shaping: dense guidance via better state, not better clue collection.
26
- - Training can run on Colab/Kaggle; environment logic remains local.
27
-
28
- ### Metis Review (gaps addressed)
29
- - Added a strict **reward whitelist** and **forbidden-source blacklist**.
30
- - Made hypothesis reward explicitly one-time and non-farmable.
31
- - Separated training reward from public deterministic benchmark score.
32
- - Normalized step costs by scenario budget to avoid punishing longer scenarios unfairly.
33
- - Added explicit regression checks for reward/public-score drift.
34
- - Resolved hidden ambiguity: reuse `classify_vulnerability` instead of introducing a new `submit_hypothesis` action.
35
-
36
- ## Work Objectives
37
- ### Core Objective
38
- Refactor the benchmark so the agent learns from state improvement and decision quality, not from authored breadcrumb rewards, while preserving a deterministic public evaluation contract.
39
-
40
- ### Deliverables
41
- - `server/environment.py` returns step rewards based on:
42
- - normalized step cost
43
- - delta incident-health potential
44
- - one-time hypothesis bonus/penalty
45
- - terminal outcome bonus/penalty
46
- - explicit unsafe/redundant action penalties
47
- - `server/grader.py` computes public `final_score` without rewarding evidence discovery, patch-id guessing, or stage progression by itself.
48
- - `server/challenge.py` contains per-scenario critical-path service weights and reward-config metadata.
49
- - `models.py` extends `classify_vulnerability` payload to carry hypothesis scoring fields.
50
- - `trainer/prompts.py` and `inference.py` understand the structured hypothesis payload.
51
- - Tests cover reward decomposition, non-farmable hypothesis scoring, and terminal correctness.
52
-
53
- ### Definition of Done (verifiable conditions with commands)
54
- - `./.venv/bin/pytest unified_incident_env/tests -q` exits 0.
55
- - For a fixed scenario, a pure query action yields only step cost / redundancy effects, not positive breadcrumb reward.
56
- - For a fixed scenario, verified containment/recovery yields positive reward deltas.
57
- - Repeating the same hypothesis does not mint additional bonus.
58
- - Public deterministic score no longer uses `relevant_investigations` or any direct clue-count term.
59
-
60
- ### Must Have
61
- - No direct positive reward for evidence discovery, unlock events, query success, patch-id selection, or stage advancement.
62
- - Incident-health potential derived only from verified/public world state.
63
- - `classify_vulnerability` supports structured hypothesis scoring with cause, services, confidence, and next action.
64
- - Training reward and public score are both documented and distinguishable.
65
-
66
- ### Must NOT Have
67
- - No new `submit_hypothesis` action unless the existing `classify_vulnerability` path proves insufficient during implementation review.
68
- - No hidden proxy breadcrumb reward through internal fields like `matched_evidence_ids`, `unlock_threshold`, or `infra_progress`.
69
- - No reward mutation outside the actual returned `reward` from `step()`.
70
- - No acceptance criteria that depend on human eyeballing logs.
71
-
72
- ## Verification Strategy
73
- > ZERO HUMAN INTERVENTION - all verification is agent-executed.
74
- - Test decision: tests-after with existing `pytest` suite plus new deterministic reward regression tests.
75
- - QA policy: every implementation task includes agent-executed assertions on reward sign/magnitude and action/schema behavior.
76
- - Evidence: `.sisyphus/evidence/task-{N}-{slug}.{ext}`
77
-
78
- ## Execution Strategy
79
- ### Parallel Execution Waves
80
- Wave 1: reward-model foundation and schema decisions
81
- - Task 1: define allowed/forbidden reward sources and scenario reward config
82
- - Task 2: extend action/state schema for structured hypotheses
83
- - Task 3: implement incident-health potential helpers
84
-
85
- Wave 2: core scoring rewrite
86
- - Task 4: replace step reward logic in environment
87
- - Task 5: replace public deterministic score breakdown
88
- - Task 6: update scenario metadata and authored weights
89
-
90
- Wave 3: contract consumers
91
- - Task 7: update prompts, response schema, and parser expectations
92
- - Task 8: update inference fallback/hypothesis generation
93
- - Task 9: update baseline/walkthrough/tests for new hypothesis payload
94
-
95
- Wave 4: regression and training-path hardening
96
- - Task 10: add reward decomposition/regression tests
97
- - Task 11: add reward/public-score drift checks for fixed scenarios
98
- - Task 12: document Colab/Kaggle GRPO usage against the new reward semantics
99
-
100
- ### Dependency Matrix (full, all tasks)
101
- - Task 1 blocks Tasks 3, 4, 5, 6.
102
- - Task 2 blocks Tasks 7, 8, 9.
103
- - Task 3 blocks Task 4.
104
- - Task 4 blocks Tasks 10 and 11.
105
- - Task 5 blocks Task 11.
106
- - Task 6 blocks Task 4 and Task 5.
107
- - Task 7 blocks Task 8 and Task 9.
108
- - Task 8 blocks Task 12.
109
- - Task 9 blocks Task 10.
110
- - Tasks 10 and 11 block final verification wave.
111
-
112
- ### Agent Dispatch Summary
113
- - Wave 1 → 3 tasks → deep / oracle-consulted / quick
114
- - Wave 2 → 3 tasks → deep / unspecified-high
115
- - Wave 3 → 3 tasks → quick / unspecified-high
116
- - Wave 4 → 3 tasks → quick / writing / unspecified-high
117
-
118
- ## TODOs
119
- > Implementation + Test = ONE task. Never separate.
120
- > EVERY task MUST have: Agent Profile + Parallelization + QA Scenarios.
121
-
122
- - [ ] 1. Define reward whitelist, blacklist, and config schema
123
-
124
- **What to do**: Add a single source of truth for reward terms in `server/challenge.py` or a nearby reward-config module. Define which signals are allowed to contribute to training reward and which are forbidden. Add per-scenario `critical_service_weights`, `step_cost_scale`, and hypothesis-bonus constants. Remove authored dependence on clue/evidence counts from the new reward path.
125
- **Must NOT do**: Do not yet rewrite reward logic in `environment.py`; do not add a new action type.
126
-
127
- **Recommended Agent Profile**:
128
- - Category: `deep` - Reason: this is the architecture lock for all later reward logic.
129
- - Skills: `[]` - no special skill required.
130
- - Omitted: `[omarchy]` - unrelated domain.
131
-
132
- **Parallelization**: Can Parallel: NO | Wave 1 | Blocks: 3,4,5,6 | Blocked By: none
133
-
134
- **References**:
135
- - Pattern: `unified_incident_env/server/challenge.py:96-156,284-345,486-546` - current evidence/unlock/verify metadata to replace or augment.
136
- - Pattern: `unified_incident_env/server/environment.py:263-323` - current breadcrumb reward path.
137
- - Pattern: `unified_incident_env/server/grader.py:73-128` - current public score terms.
138
- - External: `https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf` - shaping must preserve the right objective.
139
- - External: `https://github.com/Farama-Foundation/Gymnasium` - step rewards should reflect environment transition quality.
140
-
141
- **Acceptance Criteria**:
142
- - [ ] Reward config defines `critical_service_weights` summing to 1.0 for every scenario.
143
- - [ ] Reward config explicitly lists forbidden reward sources: evidence discovery, clue unlock, patch-id correctness, stage advancement, query success.
144
- - [ ] Existing scenario fixtures still load successfully.
145
-
146
- **QA Scenarios**:
147
- ```
148
- Scenario: Reward config loads for all scenarios
149
- Tool: Bash
150
- Steps: Run a Python one-liner importing all scenarios and validating weight sums and required keys.
151
- Expected: Exit 0; every scenario has complete reward config and valid normalized weights.
152
- Evidence: .sisyphus/evidence/task-1-reward-config.txt
153
-
154
- Scenario: Forbidden-source list is complete
155
- Tool: Bash
156
- Steps: Grep config and associated tests for all banned terms.
157
- Expected: Forbidden-source entries exist and are asserted in tests.
158
- Evidence: .sisyphus/evidence/task-1-reward-config-grep.txt
159
- ```
160
-
161
- **Commit**: YES | Message: `refactor(rewards): define shaping config and forbidden reward sources` | Files: `unified_incident_env/server/challenge.py`, nearby config module, tests
162
-
163
- - [ ] 2. Extend `classify_vulnerability` into a structured hypothesis commit
164
-
165
- **What to do**: Modify `UnifiedIncidentAction` so `classify_vulnerability` carries a structured hypothesis payload: `vulnerability_type`, `affected_services`, `confidence`, and `recommended_next_action`. Update validators, observation/state mirrors if needed, and any schema-generation logic that relies on action fields.
166
- **Must NOT do**: Do not add `submit_hypothesis`; do not break existing parsing for valid old payloads without an explicit migration path.
167
-
168
- **Recommended Agent Profile**:
169
- - Category: `unspecified-high` - Reason: touches schema, parser expectations, and compatibility.
170
- - Skills: `[]`
171
- - Omitted: `[omarchy]`
172
-
173
- **Parallelization**: Can Parallel: YES | Wave 1 | Blocks: 7,8,9 | Blocked By: none
174
-
175
- **References**:
176
- - Pattern: `unified_incident_env/models.py:11-67` - current action schema.
177
- - Pattern: `unified_incident_env/trainer/prompts.py:216-230,385-405` - required-field and example generation.
178
- - Pattern: `unified_incident_env/tests/test_environment.py:333-345` - public action schema lock.
179
- - Pattern: `unified_incident_env/tests/test_trainer.py:45-107` - parser behavior expectations.
180
-
181
- **Acceptance Criteria**:
182
- - [ ] `classify_vulnerability` requires the new structured fields.
183
- - [ ] Existing explicit valid actions with complete fields parse successfully.
184
- - [ ] Tests cover missing `confidence`, malformed `affected_services`, and invalid recommended action values.
185
-
186
- **QA Scenarios**:
187
- ```
188
- Scenario: Structured hypothesis validates
189
- Tool: Bash
190
- Steps: Construct a valid classify_vulnerability action via Python and print model_dump.
191
- Expected: Exit 0; payload includes all structured hypothesis fields.
192
- Evidence: .sisyphus/evidence/task-2-hypothesis-valid.txt
193
-
194
- Scenario: Invalid hypothesis is rejected
195
- Tool: Bash
196
- Steps: Construct invalid actions missing required hypothesis fields.
197
- Expected: Validation raises deterministic errors.
198
- Evidence: .sisyphus/evidence/task-2-hypothesis-invalid.txt
199
- ```
200
-
201
- **Commit**: YES | Message: `feat(schema): structure vulnerability classification as scored hypothesis` | Files: `unified_incident_env/models.py`, parsers, tests
202
-
203
- - [ ] 3. Implement incident-health potential helpers
204
-
205
- **What to do**: Add helper functions in `server/environment.py` (or a sibling reward helper module) to compute `operational_health`, `security_health`, and `incident_health_potential` from public/verified state only. Use service-status values `healthy=1.0`, `degraded=0.4`, `crashed=0.0`, weighted by scenario-authored critical-path weights.
206
- **Must NOT do**: Do not compute potential from evidence counters, stage names, recovery index, or hidden authored truth labels.
207
-
208
- **Recommended Agent Profile**:
209
- - Category: `quick` - Reason: local pure-function implementation once config is fixed.
210
- - Skills: `[]`
211
- - Omitted: `[omarchy]`
212
-
213
- **Parallelization**: Can Parallel: NO | Wave 1 | Blocks: 4 | Blocked By: 1
214
-
215
- **References**:
216
- - Pattern: `unified_incident_env/server/environment.py:501-516,556-560,692-700,856-947`
217
- - Pattern: `unified_incident_env/models.py:132-164,176-250`
218
- - External: Ng/Harada/Russell shaping paper above.
219
-
220
- **Acceptance Criteria**:
221
- - [ ] Potential helpers are pure and deterministic.
222
- - [ ] Potential increases when critical-path services improve.
223
- - [ ] Potential does not change from evidence-only discoveries when service/security health stays the same.
224
-
225
- **QA Scenarios**:
226
- ```
227
- Scenario: Potential rises on service recovery
228
- Tool: Bash
229
- Steps: Create before/after state fixtures with one critical service moving crashed -> healthy.
230
- Expected: after_potential > before_potential.
231
- Evidence: .sisyphus/evidence/task-3-potential-rise.txt
232
-
233
- Scenario: Evidence-only change has no positive shaping
234
- Tool: Bash
235
- Steps: Compare states that differ only by evidence counters/unlock flags.
236
- Expected: potential delta == 0.
237
- Evidence: .sisyphus/evidence/task-3-potential-no-breadcrumb.txt
238
- ```
239
-
240
- **Commit**: YES | Message: `refactor(rewards): add incident-health potential helpers` | Files: `unified_incident_env/server/environment.py`, tests
241
-
242
- - [ ] 4. Rewrite environment step rewards around delta health + cost + penalties
243
-
244
- **What to do**: Replace per-handler positive breadcrumb rewards with a single post-transition reward computation based on `gamma * Φ(s') - Φ(s)`, normalized step cost, tiny hypothesis bonus/penalty, and explicit unsafe/redundant-action surcharges. Ensure repeated-action penalties flow through returned `reward`, not hidden cumulative mutations.
245
- **Must NOT do**: Do not keep direct `+0.05` query rewards, direct patch-id credit, or verify-button credit.
246
-
247
- **Recommended Agent Profile**:
248
- - Category: `deep` - Reason: central behavior change with many edge cases.
249
- - Skills: `[]`
250
- - Omitted: `[omarchy]`
251
-
252
- **Parallelization**: Can Parallel: NO | Wave 2 | Blocks: 10,11 | Blocked By: 1,3,6
253
-
254
- **References**:
255
- - Pattern: `unified_incident_env/server/environment.py:103-177,263-323,325-554,569-601`
256
- - Pattern: `unified_incident_env/tests/test_environment.py:205-232,307-330`
257
- - Pattern: `unified_incident_env/server/challenge.py` reward-relevant scenario metadata after Task 1.
258
-
259
- **Acceptance Criteria**:
260
- - [ ] Query/evidence actions emit only step cost or redundancy penalty unless the underlying world state improves.
261
- - [ ] Wrong/harmful actions emit negative reward.
262
- - [ ] Verified service recovery and exploit containment emit positive reward due to state improvement.
263
- - [ ] No hidden mutation adjusts cumulative reward independently of returned reward.
264
-
265
- **QA Scenarios**:
266
- ```
267
- Scenario: Investigation no longer gives breadcrumb reward
268
- Tool: Bash
269
- Steps: Run a fixed scenario reset then a single query action that only reveals evidence.
270
- Expected: reward <= 0, with no positive breadcrumb term.
271
- Evidence: .sisyphus/evidence/task-4-no-query-reward.txt
272
-
273
- Scenario: Verified recovery yields positive reward
274
- Tool: Bash
275
- Steps: Execute a known-good mitigation step that improves critical service health.
276
- Expected: reward > 0 and health potential increases.
277
- Evidence: .sisyphus/evidence/task-4-recovery-positive.txt
278
- ```
279
-
280
- **Commit**: YES | Message: `refactor(rewards): score steps by health delta and normalized costs` | Files: `unified_incident_env/server/environment.py`, tests
281
-
282
- - [ ] 5. Rewrite public deterministic score to remove breadcrumb terms
283
-
284
- **What to do**: Update `server/grader.py` so `final_score` reflects verified operational recovery, verified security completion, efficiency, and postmortem quality without direct investigation-count or patch-id-guess terms. Preserve deterministic scoring/report shape.
285
- **Must NOT do**: Do not make public score depend on hidden health potential internals or trainer-specific gamma.
286
-
287
- **Recommended Agent Profile**:
288
- - Category: `unspecified-high` - Reason: public benchmark semantics change.
289
- - Skills: `[]`
290
- - Omitted: `[omarchy]`
291
-
292
- **Parallelization**: Can Parallel: YES | Wave 2 | Blocks: 11 | Blocked By: 1,6
293
-
294
- **References**:
295
- - Pattern: `unified_incident_env/server/grader.py:68-201`
296
- - Pattern: `unified_incident_env/tests/test_environment.py:349-388`
297
-
298
- **Acceptance Criteria**:
299
- - [ ] `relevant_investigations` is no longer part of `infrastructure_score`.
300
- - [ ] `selected_patch` or `selected_vulnerability` alone do not award public score before verification/completion.
301
- - [ ] Existing report/check structure remains deterministic.
302
-
303
- **QA Scenarios**:
304
- ```
305
- Scenario: Breadcrumb-only progress does not lift public score
306
- Tool: Bash
307
- Steps: Build a grader state with evidence collected but no verified containment/recovery.
308
- Expected: score remains low and below resolved benchmark thresholds.
309
- Evidence: .sisyphus/evidence/task-5-no-breadcrumb-public-score.txt
310
-
311
- Scenario: Verified containment and recovery dominate score
312
- Tool: Bash
313
- Steps: Compare partial state vs fully recovered/verified state in grader.
314
- Expected: fully recovered score > partial score.
315
- Evidence: .sisyphus/evidence/task-5-public-score-compare.txt
316
- ```
317
-
318
- **Commit**: YES | Message: `refactor(grader): remove breadcrumb terms from public score` | Files: `unified_incident_env/server/grader.py`, tests
319
-
320
- - [ ] 6. Add scenario-authored reward metadata and critical-path weights
321
-
322
- **What to do**: Extend each scenario in `server/challenge.py` with deterministic critical-path service weights and reward metadata used by Tasks 3–5. Ensure these weights are scenario-local and normalized.
323
- **Must NOT do**: Do not infer weights dynamically from evidence or runtime guesses.
324
-
325
- **Recommended Agent Profile**:
326
- - Category: `quick`
327
- - Skills: `[]`
328
- - Omitted: `[omarchy]`
329
-
330
- **Parallelization**: Can Parallel: YES | Wave 2 | Blocks: 4,5 | Blocked By: 1
331
-
332
- **References**:
333
- - Pattern: `unified_incident_env/server/challenge.py:96-199,284-403,486-610`
334
-
335
- **Acceptance Criteria**:
336
- - [ ] Every scenario includes valid reward metadata.
337
- - [ ] Hard scenario weights emphasize worker/database path appropriately.
338
- - [ ] Tests verify normalization and required keys.
339
-
340
- **QA Scenarios**:
341
- ```
342
- Scenario: Scenario reward metadata validates
343
- Tool: Bash
344
- Steps: Import all scenarios and validate reward metadata shape.
345
- Expected: Exit 0; all scenarios satisfy schema.
346
- Evidence: .sisyphus/evidence/task-6-scenario-metadata.txt
347
-
348
- Scenario: Weight normalization is enforced
349
- Tool: Bash
350
- Steps: Sum critical_service_weights for each scenario.
351
- Expected: Each sum == 1.0 within tolerance.
352
- Evidence: .sisyphus/evidence/task-6-weight-sums.txt
353
- ```
354
-
355
- **Commit**: YES | Message: `feat(challenge): add critical-path service weights for reward shaping` | Files: `unified_incident_env/server/challenge.py`, tests
356
-
357
- - [ ] 7. Update trainer prompt/schema generation for structured hypotheses
358
-
359
- **What to do**: Update `trainer/prompts.py` and parser-adjacent tests so `classify_vulnerability` examples and required fields include `affected_services`, `confidence`, and `recommended_next_action`. Fix the verification-stage mismatch explicitly if still present after schema changes.
360
- **Must NOT do**: Do not leak teacher actions into runtime prompts.
361
-
362
- **Recommended Agent Profile**:
363
- - Category: `quick`
364
- - Skills: `[]`
365
- - Omitted: `[omarchy]`
366
-
367
- **Parallelization**: Can Parallel: YES | Wave 3 | Blocks: 8,9 | Blocked By: 2
368
-
369
- **References**:
370
- - Pattern: `unified_incident_env/trainer/prompts.py:96-148,385-434`
371
- - Pattern: `unified_incident_env/tests/test_trainer.py:229-253`
372
-
373
- **Acceptance Criteria**:
374
- - [ ] Runtime prompt examples for `classify_vulnerability` include the structured hypothesis payload.
375
- - [ ] `strict` and `lenient` behavior remain meaningfully distinct.
376
- - [ ] Verification-stage action table is internally consistent across environment and prompt schema.
377
-
378
- **QA Scenarios**:
379
- ```
380
- Scenario: Prompt shows structured hypothesis example
381
- Tool: Bash
382
- Steps: Build a runtime request in security_subquest stage.
383
- Expected: User prompt contains hypothesis fields and valid JSON example.
384
- Evidence: .sisyphus/evidence/task-7-prompt-hypothesis.txt
385
-
386
- Scenario: Strict mode remains stricter
387
- Tool: Bash
388
- Steps: Compare strict and lenient runtime requests with correction memory text.
389
- Expected: strict omits lenient correction hints.
390
- Evidence: .sisyphus/evidence/task-7-strict-vs-lenient.txt
391
- ```
392
-
393
- **Commit**: YES | Message: `feat(trainer): prompt structured vulnerability hypotheses` | Files: `unified_incident_env/trainer/prompts.py`, tests
394
-
395
- - [ ] 8. Update inference fallback and schema handling for structured hypotheses
396
-
397
- **What to do**: Update `inference.py` so structured hypothesis payloads are generated, parsed, and repaired consistently. Keep the already-fixed verification-failure fallback behavior intact.
398
- **Must NOT do**: Do not reintroduce heuristic loops that bypass the new structured contract.
399
-
400
- **Recommended Agent Profile**:
401
- - Category: `unspecified-high`
402
- - Skills: `[]`
403
- - Omitted: `[omarchy]`
404
-
405
- **Parallelization**: Can Parallel: YES | Wave 3 | Blocks: 12 | Blocked By: 2,7
406
-
407
- **References**:
408
- - Pattern: `inference.py:279-472,475-568,865-905,996-1094,1190-1241`
409
- - Pattern: `unified_incident_env/tests/test_submission_inference.py:99-166,205-355`
410
-
411
- **Acceptance Criteria**:
412
- - [ ] Fallback classification outputs valid structured hypotheses.
413
- - [ ] Repeated verification failures still return to patching.
414
- - [ ] Submission inference tests cover malformed hypothesis payloads.
415
-
416
- **QA Scenarios**:
417
- ```
418
- Scenario: Fallback builds structured hypothesis
419
- Tool: Bash
420
- Steps: Build fallback action in security_subquest before patching.
421
- Expected: classify_vulnerability action includes services, confidence, and next action fields.
422
- Evidence: .sisyphus/evidence/task-8-fallback-hypothesis.txt
423
-
424
- Scenario: Verification failure still re-patches
425
- Tool: Bash
426
- Steps: Reproduce failed verification state.
427
- Expected: narrowed actions and fallback choose apply_patch, not re-verify.
428
- Evidence: .sisyphus/evidence/task-8-repatch-after-failed-verify.txt
429
- ```
430
-
431
- **Commit**: YES | Message: `feat(inference): emit structured hypotheses and preserve safe fallback` | Files: `inference.py`, tests
432
-
433
- - [ ] 9. Update baselines and walkthroughs for new hypothesis payload
434
-
435
- **What to do**: Update `scripts/baseline_agent.py`, walkthroughs, and any deterministic sample flows so they emit the structured `classify_vulnerability` action. Keep exact scenario solutions intact.
436
- **Must NOT do**: Do not alter scenario truth or recovery order here.
437
-
438
- **Recommended Agent Profile**:
439
- - Category: `quick`
440
- - Skills: `[]`
441
- - Omitted: `[omarchy]`
442
-
443
- **Parallelization**: Can Parallel: YES | Wave 3 | Blocks: 10 | Blocked By: 2,7
444
-
445
- **References**:
446
- - Pattern: `unified_incident_env/scripts/baseline_agent.py`
447
- - Pattern: `unified_incident_env/scripts/walkthrough.py`
448
- - Pattern: `unified_incident_env/tests/test_environment.py` happy-path helpers.
449
-
450
- **Acceptance Criteria**:
451
- - [ ] Baseline agent still solves all three scenarios.
452
- - [ ] Structured hypothesis payload appears in the baseline classify step.
453
-
454
- **QA Scenarios**:
455
- ```
456
- Scenario: Baseline still solves preset pack
457
- Tool: Bash
458
- Steps: Run the baseline walkthrough or equivalent deterministic script/tests.
459
- Expected: All scenarios resolve successfully.
460
- Evidence: .sisyphus/evidence/task-9-baseline-solves.txt
461
-
462
- Scenario: Baseline classify step is structured
463
- Tool: Bash
464
- Steps: Print the classify_vulnerability payload from the baseline plan.
465
- Expected: Includes new hypothesis fields.
466
- Evidence: .sisyphus/evidence/task-9-baseline-structured-hypothesis.txt
467
- ```
468
-
469
- **Commit**: YES | Message: `refactor(baseline): emit structured classification hypotheses` | Files: baseline/walkthrough/tests
470
-
471
- - [ ] 10. Add reward decomposition and anti-breadcrumb regression tests
472
-
473
- **What to do**: Add deterministic environment tests proving query/evidence actions no longer receive positive breadcrumb rewards and that repeated hypotheses do not farm reward.
474
- **Must NOT do**: Do not rely on broad “final score looks okay” assertions alone.
475
-
476
- **Recommended Agent Profile**:
477
- - Category: `unspecified-high`
478
- - Skills: `[]`
479
- - Omitted: `[omarchy]`
480
-
481
- **Parallelization**: Can Parallel: YES | Wave 4 | Blocks: Final verification | Blocked By: 4,9
482
-
483
- **References**:
484
- - Pattern: `unified_incident_env/tests/test_environment.py:205-232,235-372`
485
-
486
- **Acceptance Criteria**:
487
- - [ ] Pure evidence gathering has no positive breadcrumb reward.
488
- - [ ] Duplicate hypothesis submissions gain at most one bonus.
489
- - [ ] Harmful actions are negative.
490
-
491
- **QA Scenarios**:
492
- ```
493
- Scenario: Duplicate hypothesis bonus is one-time only
494
- Tool: Bash
495
- Steps: Submit same classify_vulnerability payload twice in a deterministic scenario.
496
- Expected: First bonus sign as designed; second bonus == 0 or negative cost only.
497
- Evidence: .sisyphus/evidence/task-10-hypothesis-dedupe.txt
498
-
499
- Scenario: Evidence-only step is non-positive
500
- Tool: Bash
501
- Steps: Reset then perform one diagnostic query.
502
- Expected: reward <= 0.
503
- Evidence: .sisyphus/evidence/task-10-evidence-nonpositive.txt
504
- ```
505
-
506
- **Commit**: YES | Message: `test(rewards): add anti-breadcrumb and hypothesis-dedupe regressions` | Files: environment tests
507
-
508
- - [ ] 11. Add reward/public-score drift regression checks
509
-
510
- **What to do**: Create fixed-scenario comparisons proving that policies improving training reward also improve or at least align with public deterministic score ordering. Compare bad, partial, and good trajectories.
511
- **Must NOT do**: Do not require exact equality between training reward sums and final score.
512
-
513
- **Recommended Agent Profile**:
514
- - Category: `deep`
515
- - Skills: `[]`
516
- - Omitted: `[omarchy]`
517
-
518
- **Parallelization**: Can Parallel: YES | Wave 4 | Blocks: Final verification | Blocked By: 4,5
519
-
520
- **References**:
521
- - Pattern: `unified_incident_env/server/grader.py`
522
- - Pattern: `unified_incident_env/server/environment.py`
523
- - Pattern: existing happy/trap path tests in `tests/test_environment.py`.
524
-
525
- **Acceptance Criteria**:
526
- - [ ] Good trajectory > partial trajectory > harmful trajectory in public score.
527
- - [ ] Good trajectory accumulates better training reward than harmful trajectory.
528
- - [ ] No scenario shows breadcrumb-only trajectories outranking true containment/recovery.
529
-
530
- **QA Scenarios**:
531
- ```
532
- Scenario: Reward/public-score ordering aligns
533
- Tool: Bash
534
- Steps: Execute scripted bad, partial, and good trajectories for a fixed scenario.
535
- Expected: reward/public-score ordering is monotonic in the desired direction.
536
- Evidence: .sisyphus/evidence/task-11-ordering.txt
537
-
538
- Scenario: Breadcrumb trajectory cannot win
539
- Tool: Bash
540
- Steps: Run a query-heavy but unrecovered trajectory.
541
- Expected: Its public score and reward stay below a truly recovered trajectory.
542
- Evidence: .sisyphus/evidence/task-11-no-breadcrumb-win.txt
543
- ```
544
-
545
- **Commit**: YES | Message: `test(rewards): add reward-vs-public-score ordering checks` | Files: environment/grader tests
546
-
547
- - [ ] 12. Document Colab/Kaggle GRPO usage with the new reward semantics
548
-
549
- **What to do**: Update docs/runbooks so training happens on Colab/Kaggle while the environment runs locally or via Docker. Explain the separation between training reward and public deterministic benchmark score, and point to the exact verification commands.
550
- **Must NOT do**: Do not leave the old reward explanation in README/execution docs.
551
-
552
- **Recommended Agent Profile**:
553
- - Category: `writing`
554
- - Skills: `[]`
555
- - Omitted: `[omarchy]`
556
-
557
- **Parallelization**: Can Parallel: YES | Wave 4 | Blocks: Final verification | Blocked By: 8
558
-
559
- **References**:
560
- - Pattern: `README.md`, `execution.md`, any training docs in repo.
561
- - External: `https://huggingface.co/docs/trl/en/openenv` - OpenEnv+TRL integration.
562
-
563
- **Acceptance Criteria**:
564
- - [ ] Docs explain training reward vs public score distinction.
565
- - [ ] Docs list the exact local test commands.
566
- - [ ] Docs specify Colab/Kaggle training and local/docker env execution.
567
-
568
- **QA Scenarios**:
569
- ```
570
- Scenario: Docs mention reward/public-score split
571
- Tool: Bash
572
- Steps: Grep updated docs for training reward, public score, and verification commands.
573
- Expected: All required topics present.
574
- Evidence: .sisyphus/evidence/task-12-doc-grep.txt
575
-
576
- Scenario: Docs commands are runnable
577
- Tool: Bash
578
- Steps: Execute at least one documented local verification command.
579
- Expected: Exit 0.
580
- Evidence: .sisyphus/evidence/task-12-doc-command.txt
581
- ```
582
-
583
- **Commit**: YES | Message: `docs(rewards): document shaping semantics and training workflow` | Files: docs/readme/runbooks
584
-
585
- ## Final Verification Wave (MANDATORY — after ALL implementation tasks)
586
- > 4 review agents run in PARALLEL. ALL must APPROVE. Present consolidated results to user and get explicit "okay" before completing.
587
- > **Do NOT auto-proceed after verification. Wait for user's explicit approval before marking work complete.**
588
- > **Never mark F1-F4 as checked before getting user's okay.** Rejection or user feedback -> fix -> re-run -> present again -> wait for okay.
589
- - [ ] F1. Plan Compliance Audit — oracle
590
- - [ ] F2. Code Quality Review — unspecified-high
591
- - [ ] F3. Real Manual QA — unspecified-high (+ playwright if UI)
592
- - [ ] F4. Scope Fidelity Check — deep
593
-
594
- ## Commit Strategy
595
- - Commit 1: reward config + scenario metadata
596
- - Commit 2: structured hypothesis schema
597
- - Commit 3: health potential helpers
598
- - Commit 4: environment reward rewrite
599
- - Commit 5: grader rewrite
600
- - Commit 6: prompt/inference/baseline contract updates
601
- - Commit 7: regression tests + docs
602
-
603
- ## Success Criteria
604
- - Training reward is driven by world-state improvement, not breadcrumb discovery.
605
- - Public deterministic benchmark score no longer rewards evidence-count collection or raw patch-id guessing.
606
- - `classify_vulnerability` supports calibrated, non-farmable hypothesis scoring.
607
- - Query/evidence/unlock actions are not directly profitable.
608
- - Verified containment + verified recovery dominate both reward and public score ordering.
609
- - All tests and deterministic regression checks pass.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,74 +1,117 @@
1
- # SRE Engineer LLM (v2): The Honest SRE Simulator
 
 
 
 
 
 
 
 
 
2
 
3
- `sre-engineer-llm` is a high-fidelity Reinforcement Learning (RL) environment designed to train and evaluate AI agents on **Site Reliability Engineering (SRE)** and **Incident Response**.
4
 
5
- Unlike traditional "scripted" environments, this benchmark uses an honest, world-state-based simulation where agents must diagnose, mitigate, and resolve production outages without "cheating" through prompt oracles or hardcoded rails.
6
 
7
- ## 🚀 Key Features
 
 
 
 
8
 
9
- - **Honest Simulation:** No stage-locks or hidden oracles. All actions are available at all times.
10
- - **State-Based Transitions:** Remediation actions (like `rollback_deploy` or `restart_service`) directly affect the health metrics of the simulated services.
11
- - **Verification Driven:** Agents must explicitly run health checks (`run_check`) to verify recovery before declaring an incident resolved.
12
- - **Realistic SRE Stack:** Includes queries for logs, metrics, dependencies, and deployment history across a microservices topology.
13
- - **Deterministic Grading:** A transparent scoring system based on final system health, user impact, and operational efficiency.
14
 
15
- ## 🛠 Action Space
 
 
16
 
17
- The agent has access to 11 discrete SRE tools:
18
 
19
- | Action | Description |
20
- | :--- | :--- |
21
- | `query_logs` | Inspect service-level error logs and traces. |
22
- | `query_metrics` | Retrieve CPU, Memory, or Latency data. |
23
- | `query_dependencies` | Map upstream and downstream service links. |
24
- | `query_deploys` | Check the deployment history for recent changes. |
25
- | `rollback_deploy` | Revert a service to its previous stable version. |
26
- | `restart_service` | Reboot a crashed or degraded service. |
27
- | `isolate_service` | Cut traffic to a service to contain blast radius. |
28
- | `submit_hypothesis` | Record a calibrated guess of the root cause. |
29
- | `run_check` | Execute a health/verification check on the system. |
30
- | `declare_resolved` | Finalize the incident after recovery is verified. |
31
- | `escalate` | Request expert attention (no-op in simulation). |
32
 
33
- ## 📁 Project Structure
 
 
 
 
34
 
35
- - `unified_incident_env/`
36
- - `server/`: The FastAPI-based environment server.
37
- - `environment.py`: Core simulator logic and world-state transitions.
38
- - `challenge.py`: Scenario catalog and baseline definitions.
39
- - `grader.py`: Deterministic scoring and reporting logic.
40
- - `models.py`: Pydantic schemas for Actions, Observations, and State.
41
- - `client.py`: Typed client for interacting with the environment.
42
- - `inference.py`: Standard entrypoint for LLM-based agent evaluation.
43
- - `run_demo.py`: End-to-end script to run the server and the baseline agent.
44
 
45
- ## 🚦 Quick Start
46
 
47
- ### 1. Install Dependencies
48
  ```bash
49
- uv venv
50
- source .venv/bin/activate
51
- uv pip install -e .
 
 
 
 
 
 
 
52
  ```
53
 
54
- ### 2. Run the Benchmark Demo
55
- This script launches the local server and executes the optimal "baseline" trajectory:
56
  ```bash
57
- python run_demo.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ```
59
 
60
- ### 3. Run Tests
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ```bash
62
- pytest unified_incident_env/tests -q
 
 
63
  ```
64
 
65
- ## 📊 Scoring Breakdown
 
 
66
 
67
- Success is measured across four primary dimensions:
68
- 1. **Recovery (45%):** Is the end-to-end system healthy and the cause removed?
69
- 2. **Security/Mitigation (35%):** Was the correct remediation target identified and fixed?
70
- 3. **Efficiency (10%):** Did the agent solve the incident within the tick budget without wasteful actions?
71
- 4. **Verification (10%):** Were all health checks passed before resolution?
72
 
73
- ## 📝 License
74
- This project is licensed under the MIT License.
 
1
+ ---
2
+ title: SRE Gym
3
+ emoji: 🚨
4
+ colorFrom: red
5
+ colorTo: yellow
6
+ sdk: docker
7
+ app_port: 8000
8
+ pinned: false
9
+ license: apache-2.0
10
+ ---
11
 
12
+ # sre-gym Fault-injecting SRE training env for OpenEnv
13
 
14
+ Most SRE agent skills are runbooks and good intentions. **sre-gym** is the other half: a fault-injecting environment with deterministic grading where an agent diagnoses a real production-style incident, chooses a safe remediation, verifies recovery, and declares resolved. Every run is scored the same way twice.
15
 
16
+ - Spec-compliant OpenEnv environment (typed Pydantic action / observation / state, `reset` / `step` / `state`, `openenv validate` green).
17
+ - 3 curriculum scenarios — easy, medium, hard — with decoy services and causal dependencies.
18
+ - 11 bounded actions. Honest state transitions. No hidden oracles.
19
+ - 21 tests passing.
20
+ - Ships a Claude Code skill + verified-runbook loop — successful solves write markdown runbooks that the next run reads back.
21
 
22
+ ## 30-second demo
 
 
 
 
23
 
24
+ ```bash
25
+ ./demo/run_demo.sh
26
+ ```
27
 
28
+ Starts the env, solves each scenario cold, writes a runbook for each, re-solves to prove the loop. Full transcript takes ~10 seconds.
29
 
30
+ ## Curriculum
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
+ | Difficulty | Scenario | Story | Decoy | Correct path |
33
+ |---|---|---|---|---|
34
+ | easy | `worker_deploy_cascade` | Bad worker deploy → DB crash-loop → login 502s | — | rollback worker → restart db → verify → resolve |
35
+ | medium | `db_config_rollout` | DB config push shrank connection pool from 80→12 | recent worker deploy | rollback **db** → restart db → verify → resolve |
36
+ | hard | `gateway_auth_rollout` | Gateway auth-middleware rollout rejects valid logins | recent worker deploy | rollback **gateway** → verify → resolve (no restart) |
37
 
38
+ Rolling back the wrong service returns a negative reward and `failure_type="wrong_remediation_target"`. Restarting before the cause is removed re-inherits the bad state. `declare_resolved` is rejected until the scenario's resolution check passes against the actual world model.
 
 
 
 
 
 
 
 
39
 
40
+ ## Install
41
 
 
42
  ```bash
43
+ # 1. Create a venv and install
44
+ python3 -m venv .venv && source .venv/bin/activate
45
+ pip install -e '.[dev]'
46
+
47
+ # 2. Start the env
48
+ uvicorn server.app:app --host 127.0.0.1 --port 8000
49
+
50
+ # 3. Run the baseline inference against it
51
+ export HF_TOKEN="…"; export ENV_BASE_URL=http://127.0.0.1:8000
52
+ python inference.py
53
  ```
54
 
55
+ ## Install the Claude Code skill
56
+
57
  ```bash
58
+ ln -s "$PWD/skill" "$HOME/.claude/skills/sre-gym"
59
+ ```
60
+
61
+ Then, in Claude Code, ask: *"Solve the db_config_rollout scenario in sre-gym."* The skill will drive the env via `skill/tools/sre_gym_client.py`, load any existing runbook from `skill/verified-runbooks/`, and append a fresh runbook on any clean solve (score > 0.85).
62
+
63
+ ## Architecture
64
+
65
+ ```
66
+ ┌────────────────────┐ HTTP / WS ┌──────────────────────┐
67
+ │ Claude Code │ ──────────────────▶ │ OpenEnv server │
68
+ │ (with sre-gym │ ◀────────────────── │ (FastAPI, uvicorn) │
69
+ │ skill loaded) │ obs, reward │ unified_incident_env │
70
+ └────────────────────┘ └──────────────────────┘
71
+ │ ▲
72
+ ▼ on clean solve (score > 0.85) │
73
+ ┌────────────────────┐ │
74
+ │ verified-runbooks/ │ ────── loaded at skill load ──┘
75
+ │ *.md │
76
+ └────────────────────┘
77
  ```
78
 
79
+ ## Scoring
80
+
81
+ Deterministic, 5 dimensions, sums to a public score in `[0.01, 0.99]`:
82
+
83
+ - **Recovery** (0–0.4): critical-path services healthy
84
+ - **Containment** (0–0.3): root cause removed or offending service isolated
85
+ - **Verification** (0–0.35): `database_recovery` + `end_to_end` checks passed
86
+ - **Impact** (0–0.15): user-impact reduced
87
+ - **Efficiency** (0–0.10): budget preserved, no wasteful repeats
88
+
89
+ Target **> 0.85** for "clean solve." That's also the runbook-record threshold.
90
+
91
+ ## Repo layout
92
+
93
+ ```
94
+ unified_incident_env/ # env core: models, environment, grader, challenge, tests
95
+ server/ # OpenEnv entrypoint wrapper
96
+ skill/ # Claude Code skill: SKILL.md, tools/, verified-runbooks/
97
+ demo/ # run_demo.sh + pitch.md
98
+ inference.py # OpenAI-client baseline for OpenEnv hackathon submission
99
+ openenv.yaml # OpenEnv manifest
100
+ Dockerfile # HF Space deployment
101
+ ```
102
+
103
+ ## Verify
104
+
105
  ```bash
106
+ pytest unified_incident_env/tests -q # 21 tests
107
+ python -m openenv.cli validate . # OpenEnv manifest check
108
+ docker build -t sre-engineer-llm:v2 . # HF Space image
109
  ```
110
 
111
+ ## Roadmap v2
112
+
113
+ Distill the accumulated `verified-runbooks/` corpus into a local 3B reviewer via [OpenClaw-RL](https://github.com/Gen-Verse/OpenClaw-RL)'s async GRPO-on-next-state loop. Same reward contract (`run_check` passes / `failure_type` absent), same grader, but a compact policy that runs without a frontier API.
114
 
115
+ ## License
 
 
 
 
116
 
117
+ Apache 2.0
 
demo/pitch.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # sre-gym — 60-second pitch
2
+
3
+ > You can't train SRE agents on production. We built the gym.
4
+
5
+ ## The story (00:00–01:00)
6
+
7
+ **[0:00–0:10 · Hook]** "Most SRE agent skills are prompts — a runbook and a good intention. We built the other half: a fault-injecting environment with deterministic grading, where every run is scored the same way twice."
8
+
9
+ **[0:10–0:25 · What it is]**
10
+ - OpenEnv-compliant. `openenv validate` passes.
11
+ - Three curriculum scenarios, easy → hard:
12
+ - **easy** `worker_deploy_cascade` — bad worker deploy cascades to a DB crash.
13
+ - **medium** `db_config_rollout` — DB config shrank the connection pool; a recent worker deploy is a decoy.
14
+ - **hard** `gateway_auth_rollout` — bad auth-middleware rollout; two plausible suspects, one right answer.
15
+ - 11 bounded actions, honest state transitions (rolling back the wrong thing *fails*), deterministic grader across recovery / containment / verification / impact / efficiency.
16
+ - 21 tests passing. One public Space URL.
17
+
18
+ **[0:25–0:55 · Live demo]** `./demo/run_demo.sh`
19
+ - Env starts. Three scenarios visible in `/tasks`.
20
+ - Runbook dir cleared; demo starts cold.
21
+ - Each scenario solves end-to-end (score ≈ 0.99, 8–10 steps).
22
+ - A markdown runbook is written per scenario from the successful trace.
23
+ - Re-solve the easy scenario — this time the skill loads the runbook first. Same score, same path, zero wasted investigation.
24
+ - Point to `skill/verified-runbooks/` — "Every clean solve makes the next one deterministic. No GRPO required for v1."
25
+
26
+ **[0:55–1:00 · Close]** "Install the skill by symlinking `skill/` into `~/.claude/skills/sre-gym`. Open source, Apache 2. v2 is the OpenClaw-RL loop — distill this corpus of verified runbooks into a local 3B reviewer."
27
+
28
+ ## The one technical claim you should be ready to defend
29
+
30
+ > "The env is honest."
31
+
32
+ - No hidden oracles. Rolling back the wrong service returns a negative reward and `failure_type="wrong_remediation_target"` — same observation contract as any other action.
33
+ - `declare_resolved` is rejected until the scenario's `resolution_check` passes, verified by actual service states in the world model, not a flag the grader peeks at.
34
+ - Rewards reward *effects*, not evidence-gathering — you can't farm the env by spamming `query_logs`.
35
+ - `restart_service` on the database before the root cause is removed returns a negative reward. Always. Because in the real world, it would crash again.
36
+
37
+ ## Judge Q&A cheat sheet
38
+
39
+ **"How is this different from running a real staging env?"**
40
+ Deterministic scoring. Every agent gets graded against the same signatures, same decoys, same tick budget. You can't do that on real infra.
41
+
42
+ **"Why only three scenarios?"**
43
+ Three clears the hackathon DQ gate (`easy/medium/hard`). Each has a decoy + causal chain — building another one is a data-entry exercise, not a design one. Adding scenarios #4–#20 is the v2 data scaling lane.
44
+
45
+ **"Why runbooks instead of GRPO?"**
46
+ For this submission, GRPO means 48 hours of training convergence risk on top of an env we just shipped. Markdown runbooks demonstrate the same loop (verified signal → persisted artefact → next run improves) in an auditable form. The GRPO wiring slots on top of the same traces when we're ready.
47
+
48
+ **"What's the skill actually doing at runtime?"**
49
+ The skill lives in `skill/SKILL.md`. It directs Claude (or any agent) to read `verified-runbooks/{scenario}.md` before the first action, drive the env through `skill/tools/sre_gym_client.py`, and append a fresh runbook on any solve with `final_score > 0.85`.
demo/run_demo.sh ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # sre-gym end-to-end demo.
3
+ # Spins up the env (or reuses a running one), solves each of the 3 scenarios
4
+ # with the baseline policy, records runbooks, shows the artefacts.
5
+ #
6
+ # Requires: python3.10+, docker (for the HF-Space-equivalent image) OR the
7
+ # repo's .venv. Defaults to .venv if present.
8
+
9
+ set -euo pipefail
10
+ cd "$(dirname "$0")/.."
11
+
12
+ PORT="${PORT:-8013}"
13
+ URL="http://127.0.0.1:${PORT}"
14
+ PY="${PYTHON:-.venv/bin/python}"
15
+ RUNBOOK_DIR="skill/verified-runbooks"
16
+
17
+ banner() { printf '\n\033[1;36m== %s ==\033[0m\n' "$*"; }
18
+ ok() { printf '\033[0;32m ✓ %s\033[0m\n' "$*"; }
19
+
20
+ banner "0 / preflight"
21
+ if [[ ! -x "$PY" ]]; then
22
+ echo " note: $PY not found, falling back to system python3" >&2
23
+ PY="python3"
24
+ fi
25
+ "$PY" -c "import unified_incident_env" 2>/dev/null || {
26
+ echo " error: unified_incident_env not importable; run 'pip install -e .' first" >&2
27
+ exit 1
28
+ }
29
+ ok "python + package ready"
30
+
31
+ banner "1 / start env"
32
+ if curl -sf "$URL/health" > /dev/null 2>&1; then
33
+ ok "env already running on $URL"
34
+ SERVER_STARTED=0
35
+ else
36
+ "$PY" -m uvicorn server.app:app --host 127.0.0.1 --port "$PORT" > /tmp/sre_gym_demo.log 2>&1 &
37
+ SERVER_PID=$!
38
+ SERVER_STARTED=1
39
+ for _ in $(seq 1 20); do
40
+ if curl -sf "$URL/health" > /dev/null 2>&1; then break; fi
41
+ sleep 0.3
42
+ done
43
+ curl -sf "$URL/health" > /dev/null || { echo " error: env failed to start" >&2; cat /tmp/sre_gym_demo.log >&2; exit 1; }
44
+ ok "env started on $URL (pid $SERVER_PID)"
45
+ fi
46
+ trap '[[ ${SERVER_STARTED:-0} -eq 1 ]] && kill ${SERVER_PID:-0} 2>/dev/null || true' EXIT
47
+
48
+ banner "2 / available scenarios"
49
+ SRE_GYM_URL="$URL" "$PY" skill/tools/sre_gym_client.py list
50
+
51
+ banner "3 / clear prior runbooks (demo starts cold)"
52
+ rm -f "$RUNBOOK_DIR"/*.md
53
+ ok "runbook directory cleared"
54
+
55
+ for scenario in worker_deploy_cascade db_config_rollout gateway_auth_rollout; do
56
+ banner "4 / solve: $scenario"
57
+ SRE_GYM_URL="$URL" "$PY" skill/tools/sre_gym_client.py solve "$scenario"
58
+ SRE_GYM_URL="$URL" "$PY" skill/tools/sre_gym_client.py record-runbook "$scenario"
59
+ done
60
+
61
+ banner "5 / verified runbooks now on disk"
62
+ ls -1 "$RUNBOOK_DIR"/*.md | sed 's|^| |'
63
+
64
+ banner "6 / re-solve easy scenario — runbook is loaded this time"
65
+ SRE_GYM_URL="$URL" "$PY" skill/tools/sre_gym_client.py solve worker_deploy_cascade | tail -4
66
+
67
+ banner "done"
68
+ echo " install the skill globally: ln -s \"$PWD/skill\" \"\$HOME/.claude/skills/sre-gym\""
69
+ echo " env log: /tmp/sre_gym_demo.log"
70
+ echo " runbooks: $RUNBOOK_DIR/"
deploy/push_to_hf.sh ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Deploy this repo to a Hugging Face Space (Docker SDK).
3
+ #
4
+ # Required:
5
+ # HF_TOKEN write-scoped HF access token
6
+ # HF_SPACE_ID e.g. yourname/sre-gym (create it at huggingface.co/new-space
7
+ # first, SDK=Docker, or let this script try to create it)
8
+ #
9
+ # Usage:
10
+ # HF_TOKEN=hf_xxx HF_SPACE_ID=yourname/sre-gym ./deploy/push_to_hf.sh
11
+ #
12
+ # After a successful push, verify from a different network:
13
+ # curl https://${space_subdomain}.hf.space/health
14
+ # curl https://${space_subdomain}.hf.space/tasks | jq '.scenarios[].difficulty'
15
+
16
+ set -euo pipefail
17
+ cd "$(dirname "$0")/.."
18
+
19
+ : "${HF_TOKEN:?HF_TOKEN is required}"
20
+ : "${HF_SPACE_ID:?HF_SPACE_ID is required, e.g. yourname/sre-gym}"
21
+
22
+ if ! command -v huggingface-cli > /dev/null; then
23
+ echo "error: huggingface-cli not installed. pip install 'huggingface_hub[cli]'" >&2
24
+ exit 1
25
+ fi
26
+
27
+ echo "== syncing openenv.yaml with HF_SPACE_ID =="
28
+ python3 - <<PY
29
+ import pathlib, re
30
+ path = pathlib.Path("openenv.yaml")
31
+ text = path.read_text()
32
+ text = re.sub(r"^ space_id:.*$", f" space_id: $HF_SPACE_ID", text, flags=re.M)
33
+ path.write_text(text)
34
+ print(f"openenv.yaml space_id -> $HF_SPACE_ID")
35
+ PY
36
+
37
+ echo "== ensuring the space exists (idempotent) =="
38
+ huggingface-cli repo create "$HF_SPACE_ID" \
39
+ --type space \
40
+ --space_sdk docker \
41
+ --token "$HF_TOKEN" \
42
+ --yes 2>&1 | grep -v "already created" || true
43
+
44
+ echo "== uploading repo =="
45
+ huggingface-cli upload "$HF_SPACE_ID" . \
46
+ --repo-type space \
47
+ --token "$HF_TOKEN" \
48
+ --commit-message "deploy sre-gym v2 (easy/medium/hard scenarios)"
49
+
50
+ subdomain="$(echo "$HF_SPACE_ID" | tr '/' '-')"
51
+ echo
52
+ echo "== deployment kicked off =="
53
+ echo " Logs: https://huggingface.co/spaces/$HF_SPACE_ID"
54
+ echo " Public: https://$subdomain.hf.space"
55
+ echo
56
+ echo "== verify from a different network (phone hotspot) =="
57
+ echo " curl https://$subdomain.hf.space/health"
58
+ echo " curl https://$subdomain.hf.space/tasks | jq '.scenarios[].difficulty'"
openclaw_integration/README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenClaw-RL integration — sre-gym shim
2
+
3
+ Plugs `sre-gym` into OpenClaw-RL's training loop without forking OpenClaw-RL.
4
+ Three artifacts:
5
+
6
+ - `pool_server.py` — FastAPI HTTP server speaking OpenClaw's lease-based
7
+ contract (`/allocate /reset /exec_tool /evaluate /close`). Wraps
8
+ `UnifiedIncidentEnvironment` behind per-lease `asyncio.Lock`s.
9
+ - `sre_env_client.py` — Drop-in replacement for OpenClaw-RL
10
+ `terminal-rl/env_client.py`. Same method signatures.
11
+ - `generate_with_sre.py` — Planned import-patch wrapper for
12
+ `terminal-rl/generate.py` (stub — filled in Friday when the OpenClaw-RL
13
+ venv is set up).
14
+
15
+ ## Quick start
16
+
17
+ ```bash
18
+ # 1. Launch the pool server
19
+ source .venv/bin/activate
20
+ uvicorn openclaw_integration.pool_server:app --host 0.0.0.0 --port 8100
21
+
22
+ # 2. Smoke-test the lifecycle from another shell
23
+ curl -sf http://127.0.0.1:8100/healthz | jq
24
+ curl -s -X POST http://127.0.0.1:8100/allocate \
25
+ -H 'content-type: application/json' \
26
+ -d '{"task_key": "gateway_auth_rollout"}'
27
+ ```
28
+
29
+ ## Wiring into OpenClaw-RL
30
+
31
+ In the OpenClaw-RL repo, after creating a fresh venv per their instructions,
32
+ point the rollout agent at our server:
33
+
34
+ ```bash
35
+ export ENV_SERVER_URL=http://127.0.0.1:8100
36
+ ```
37
+
38
+ Then patch one import in `OpenClaw-RL/terminal-rl/generate.py`:
39
+
40
+ ```diff
41
+ - from env_client import create_env_client
42
+ + import sys; sys.path.insert(0, "/path/to/sre-enginnerllm")
43
+ + from openclaw_integration.sre_env_client import create_env_client
44
+ ```
45
+
46
+ No other OpenClaw-RL source files need to change. The
47
+ `run_qwen35_4b_openclaw_rl.sh` launch script works as-is after that.
48
+
49
+ ## Task keys (scenarios)
50
+
51
+ - `worker_deploy_cascade` (easy)
52
+ - `db_config_rollout` (medium)
53
+ - `gateway_auth_rollout` (hard)
54
+
55
+ ## Lifecycle contract
56
+
57
+ ```
58
+ allocate(task_key) -> {ok: true, lease_id}
59
+ reset(lease_id, task_meta, run_ctx) -> {ok: true, observation: "<json>"}
60
+ exec_tool(lease_id, tool_call) -> {ok: true, observation: "<json>"}
61
+ evaluate(lease_id) -> {ok: true, score: float}
62
+ close(lease_id) -> {ok: true}
63
+ ```
64
+
65
+ - `task_meta.scenario_id` takes precedence over `task_key` at reset time if
66
+ set (useful for procgen Friday).
67
+ - `tool_call.name` maps directly to `UnifiedIncidentAction.action_type`.
68
+ - `tool_call.arguments` is the kwargs dict (service, metric, check_name,
69
+ hypothesis).
70
+ - An invalid action is returned as an observation `{"error": "...",
71
+ "tool_call": {...}}` rather than raising — training gets the negative
72
+ signal without crashing the rollout.
73
+
74
+ ## Lease TTL / reaper
75
+
76
+ - `POOL_SERVER_LEASE_TTL_S` (default 600s) — lease idle timeout.
77
+ - `POOL_SERVER_REAPER_PERIOD` (default 30s) — reaper tick period.
78
+
79
+ Reaper runs in lifespan background task; evicts idle leases so long
80
+ training runs don't leak env instances.
openclaw_integration/__init__.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ """OpenClaw-RL integration shim for sre-gym.
2
+
3
+ This package exposes sre-gym through the lease-based HTTP contract used by
4
+ OpenClaw-RL's `terminal-rl/` and `swe-rl/` training loops, so the existing
5
+ OpenClaw-RL rollout+training scripts can target this env without code forks.
6
+ """
openclaw_integration/generate_with_sre.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Import-patch adapter for OpenClaw-RL's `terminal-rl/generate.py`.
2
+
3
+ STUB — filled in Friday when the OpenClaw-RL venv is set up.
4
+
5
+ The shape is minimal: OpenClaw-RL's `terminal-rl/generate.py` does
6
+ `from env_client import create_env_client`. All we need is to redirect that
7
+ import to our client. Two options, pick one Friday:
8
+
9
+ Option A: monkey-patch via PYTHONPATH + shim module
10
+ export PYTHONPATH="/path/to/sre-enginnerllm:$PYTHONPATH"
11
+ mkdir -p /tmp/openclaw_shim && cd /tmp/openclaw_shim
12
+ cat > env_client.py <<'PY'
13
+ from openclaw_integration.sre_env_client import create_env_client
14
+ PY
15
+ export PYTHONPATH="/tmp/openclaw_shim:$PYTHONPATH"
16
+
17
+ Option B: patch generate.py directly
18
+ sed -i 's|from env_client import create_env_client|from openclaw_integration.sre_env_client import create_env_client|' \
19
+ /path/to/OpenClaw-RL/terminal-rl/generate.py
20
+
21
+ Option A is reversible and cleaner. Option B is one line and survives a
22
+ pip install -e.
23
+
24
+ This file is intentionally empty beyond this docstring to keep the shim
25
+ surface area tiny. When Friday work begins, the actual adapter (if any is
26
+ needed beyond the import swap) lives here.
27
+ """
openclaw_integration/pool_server.py ADDED
@@ -0,0 +1,273 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """FastAPI pool server exposing sre-gym in OpenClaw-RL's lease-based contract.
2
+
3
+ OpenClaw-RL's rollout agent drives an env with this lifecycle per episode:
4
+
5
+ allocate(task_key) -> {lease_id}
6
+ reset(lease_id, task_meta, run_ctx)
7
+ exec_tool(lease_id, tool_call) -> observation_string # repeated
8
+ evaluate(lease_id) -> score
9
+ close(lease_id)
10
+
11
+ We wrap a `UnifiedIncidentEnvironment` instance per lease. Lease state is
12
+ guarded by per-lease `asyncio.Lock` so 8-way concurrent rollouts on the same
13
+ server stay consistent. Idle leases are reaped after LEASE_TTL_S seconds.
14
+
15
+ Run standalone:
16
+ uvicorn openclaw_integration.pool_server:app --host 0.0.0.0 --port 8100
17
+
18
+ Env vars:
19
+ POOL_SERVER_LEASE_TTL_S default 600
20
+ POOL_SERVER_REAPER_PERIOD default 30
21
+ """
22
+
23
+ from __future__ import annotations
24
+
25
+ import asyncio
26
+ import json
27
+ import logging
28
+ import os
29
+ import sys
30
+ import time
31
+ import uuid
32
+ from contextlib import asynccontextmanager
33
+ from dataclasses import dataclass, field
34
+ from pathlib import Path
35
+ from typing import Any
36
+
37
+ from fastapi import FastAPI
38
+ from pydantic import BaseModel, Field
39
+
40
+ # Make the sibling package importable when launched via uvicorn from anywhere.
41
+ _REPO_ROOT = Path(__file__).resolve().parent.parent
42
+ if str(_REPO_ROOT) not in sys.path:
43
+ sys.path.insert(0, str(_REPO_ROOT))
44
+
45
+ from unified_incident_env.models import UnifiedIncidentAction # noqa: E402
46
+ from unified_incident_env.server.challenge import SCENARIOS # noqa: E402
47
+ from unified_incident_env.server.environment import UnifiedIncidentEnvironment # noqa: E402
48
+
49
+ logger = logging.getLogger("sre_gym.pool_server")
50
+ logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
51
+
52
+ LEASE_TTL_S = float(os.getenv("POOL_SERVER_LEASE_TTL_S", "600"))
53
+ REAPER_PERIOD_S = float(os.getenv("POOL_SERVER_REAPER_PERIOD", "30"))
54
+
55
+
56
+ @dataclass
57
+ class Lease:
58
+ lease_id: str
59
+ task_key: str
60
+ env: UnifiedIncidentEnvironment
61
+ lock: asyncio.Lock = field(default_factory=asyncio.Lock)
62
+ last_touch: float = field(default_factory=time.time)
63
+ reset_done: bool = False
64
+ final_score: float | None = None
65
+
66
+ def touch(self) -> None:
67
+ self.last_touch = time.time()
68
+
69
+
70
+ class AllocateRequest(BaseModel):
71
+ task_key: str
72
+ request_id: str | None = None
73
+
74
+
75
+ class LeaseRequest(BaseModel):
76
+ lease_id: str
77
+
78
+
79
+ class ResetRequest(BaseModel):
80
+ lease_id: str
81
+ task_meta: dict[str, Any] = Field(default_factory=dict)
82
+ run_ctx: dict[str, Any] = Field(default_factory=dict)
83
+ task_timeouts: dict[str, Any] | None = None
84
+
85
+
86
+ class ToolCall(BaseModel):
87
+ name: str
88
+ arguments: dict[str, Any] = Field(default_factory=dict)
89
+
90
+
91
+ class ExecToolRequest(BaseModel):
92
+ lease_id: str
93
+ tool_call: ToolCall
94
+
95
+
96
+ class LeasePool:
97
+ def __init__(self) -> None:
98
+ self._leases: dict[str, Lease] = {}
99
+ self._dict_lock = asyncio.Lock()
100
+
101
+ async def allocate(self, task_key: str) -> Lease:
102
+ if task_key not in SCENARIOS:
103
+ raise ValueError(f"Unknown task_key {task_key!r}; known: {list(SCENARIOS)}")
104
+ env = UnifiedIncidentEnvironment()
105
+ lease = Lease(lease_id=str(uuid.uuid4()), task_key=task_key, env=env)
106
+ async with self._dict_lock:
107
+ self._leases[lease.lease_id] = lease
108
+ logger.info("allocate: lease=%s task=%s", lease.lease_id, task_key)
109
+ return lease
110
+
111
+ async def get(self, lease_id: str) -> Lease:
112
+ async with self._dict_lock:
113
+ lease = self._leases.get(lease_id)
114
+ if lease is None:
115
+ raise KeyError(f"Unknown lease {lease_id}")
116
+ lease.touch()
117
+ return lease
118
+
119
+ async def close(self, lease_id: str) -> bool:
120
+ async with self._dict_lock:
121
+ lease = self._leases.pop(lease_id, None)
122
+ if lease is None:
123
+ return False
124
+ logger.info("close: lease=%s task=%s", lease_id, lease.task_key)
125
+ return True
126
+
127
+ async def reap(self) -> int:
128
+ now = time.time()
129
+ stale: list[str] = []
130
+ async with self._dict_lock:
131
+ for lease_id, lease in list(self._leases.items()):
132
+ if now - lease.last_touch > LEASE_TTL_S:
133
+ stale.append(lease_id)
134
+ for lease_id in stale:
135
+ self._leases.pop(lease_id, None)
136
+ if stale:
137
+ logger.info("reaper: evicted %d stale lease(s)", len(stale))
138
+ return len(stale)
139
+
140
+ def active_count(self) -> int:
141
+ return len(self._leases)
142
+
143
+
144
+ pool = LeasePool()
145
+
146
+
147
+ async def _reaper_loop() -> None:
148
+ while True:
149
+ try:
150
+ await pool.reap()
151
+ except Exception:
152
+ logger.exception("reaper loop tick failed")
153
+ await asyncio.sleep(REAPER_PERIOD_S)
154
+
155
+
156
+ @asynccontextmanager
157
+ async def lifespan(app: FastAPI):
158
+ task = asyncio.create_task(_reaper_loop())
159
+ try:
160
+ yield
161
+ finally:
162
+ task.cancel()
163
+ try:
164
+ await task
165
+ except asyncio.CancelledError:
166
+ pass
167
+
168
+
169
+ app = FastAPI(title="sre-gym OpenClaw pool server", lifespan=lifespan)
170
+
171
+
172
+ def _observation_string(obs: Any, *, reward: float | None = None) -> str:
173
+ """Render a UnifiedIncidentObservation as the single string OpenClaw
174
+ rollout agents expect from exec_tool."""
175
+ payload = {
176
+ "tick": obs.tick_count,
177
+ "workflow_stage": obs.workflow_stage,
178
+ "last_action_result": obs.last_action_result,
179
+ "tool_output": obs.tool_output,
180
+ "failure_type": obs.failure_type,
181
+ "why_failed": obs.why_failed,
182
+ "loop_warning": obs.loop_warning,
183
+ "reward": reward,
184
+ "checks": [{"name": c.name, "passed": c.passed} for c in obs.checks],
185
+ "active_alerts": [{"service": a.service, "severity": a.severity, "message": a.message} for a in obs.active_alerts],
186
+ "noise_alerts": [{"service": a.service, "severity": a.severity, "message": a.message} for a in obs.noise_alerts],
187
+ "service_health": {name: s.status for name, s in obs.service_health.items()},
188
+ "allowed_actions": obs.allowed_actions,
189
+ "required_fields_by_action": obs.required_fields_by_action,
190
+ "blast_radius": obs.blast_radius,
191
+ "final_score": obs.final_score,
192
+ "done": obs.done,
193
+ "prompt_text": obs.prompt_text,
194
+ }
195
+ return json.dumps(payload, separators=(",", ":"))
196
+
197
+
198
+ @app.get("/healthz")
199
+ async def healthz() -> dict[str, Any]:
200
+ return {"ok": True, "active_leases": pool.active_count(), "scenarios": list(SCENARIOS.keys())}
201
+
202
+
203
+ @app.post("/allocate")
204
+ async def allocate(request: AllocateRequest) -> dict[str, Any]:
205
+ try:
206
+ lease = await pool.allocate(request.task_key)
207
+ except ValueError as exc:
208
+ return {"ok": False, "error": str(exc)}
209
+ return {"ok": True, "lease_id": lease.lease_id, "task_key": lease.task_key, "request_id": request.request_id}
210
+
211
+
212
+ @app.post("/heartbeat")
213
+ async def heartbeat(request: LeaseRequest) -> dict[str, Any]:
214
+ try:
215
+ await pool.get(request.lease_id)
216
+ except KeyError as exc:
217
+ return {"ok": False, "error": str(exc)}
218
+ return {"ok": True}
219
+
220
+
221
+ @app.post("/reset")
222
+ async def reset(request: ResetRequest) -> dict[str, Any]:
223
+ try:
224
+ lease = await pool.get(request.lease_id)
225
+ except KeyError as exc:
226
+ return {"ok": False, "error": str(exc)}
227
+ async with lease.lock:
228
+ scenario_id = request.task_meta.get("scenario_id") or lease.task_key
229
+ obs = lease.env.reset(scenario_id=scenario_id)
230
+ lease.reset_done = True
231
+ lease.final_score = None
232
+ return {"ok": True, "observation": _observation_string(obs)}
233
+
234
+
235
+ @app.post("/exec_tool")
236
+ async def exec_tool(request: ExecToolRequest) -> dict[str, Any]:
237
+ try:
238
+ lease = await pool.get(request.lease_id)
239
+ except KeyError as exc:
240
+ return {"ok": False, "error": str(exc)}
241
+ if not lease.reset_done:
242
+ return {"ok": False, "error": "reset has not been called for this lease"}
243
+
244
+ action_kwargs = {"action_type": request.tool_call.name, **request.tool_call.arguments}
245
+ try:
246
+ action = UnifiedIncidentAction(**action_kwargs)
247
+ except Exception as exc:
248
+ # Return the validation error to the rollout agent as a no-op
249
+ # observation so training sees the failure signal without crashing.
250
+ return {"ok": True, "observation": json.dumps({"error": f"invalid action: {exc}", "tool_call": request.tool_call.model_dump()})}
251
+
252
+ async with lease.lock:
253
+ obs = lease.env.step(action)
254
+ lease.final_score = float(obs.final_score)
255
+ return {"ok": True, "observation": _observation_string(obs, reward=float(obs.reward))}
256
+
257
+
258
+ @app.post("/evaluate")
259
+ async def evaluate(request: LeaseRequest) -> dict[str, Any]:
260
+ try:
261
+ lease = await pool.get(request.lease_id)
262
+ except KeyError as exc:
263
+ return {"ok": False, "error": str(exc)}
264
+ score = lease.final_score if lease.final_score is not None else float(lease.env.state.final_score)
265
+ return {"ok": True, "score": score}
266
+
267
+
268
+ @app.post("/close")
269
+ async def close(request: LeaseRequest) -> dict[str, Any]:
270
+ closed = await pool.close(request.lease_id)
271
+ if not closed:
272
+ return {"ok": False, "error": f"Unknown lease {request.lease_id}"}
273
+ return {"ok": True}
openclaw_integration/sre_env_client.py ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Drop-in replacement for OpenClaw-RL `terminal-rl/env_client.py`.
2
+
3
+ Interface matches `TerminalEnvClient` (allocate / heartbeat / reset /
4
+ exec_tool / evaluate / close) so OpenClaw-RL's rollout agent can swap imports
5
+ with one line.
6
+
7
+ Standalone (no slime dep) — uses httpx directly. To use slime's retrying
8
+ post() helper instead, replace `_post` with `slime.utils.http_utils.post`.
9
+
10
+ Env vars:
11
+ ENV_SERVER_URL required, e.g. http://127.0.0.1:8100
12
+ ENV_HTTP_MAX_RETRIES default 10
13
+ ENV_ALLOCATE_MAX_RETRIES default 10
14
+ ENV_EVALUATE_MAX_RETRIES default 1
15
+ ENV_CLOSE_MAX_RETRIES default 3
16
+ ENV_EXEC_TOOL_MAX_RETRIES default 3
17
+ ENV_HTTP_TIMEOUT_S default 30
18
+ """
19
+
20
+ from __future__ import annotations
21
+
22
+ import asyncio
23
+ import logging
24
+ import os
25
+ from typing import Any
26
+
27
+ import httpx
28
+
29
+ logger = logging.getLogger(__name__)
30
+
31
+
32
+ def create_env_client() -> "SreEnvClient":
33
+ env_server_url = os.getenv("ENV_SERVER_URL", "")
34
+ if not env_server_url:
35
+ raise RuntimeError("ENV_SERVER_URL is empty.")
36
+ return SreEnvClient(env_server_url)
37
+
38
+
39
+ async def _post(
40
+ url: str,
41
+ payload: dict[str, Any],
42
+ *,
43
+ max_retries: int,
44
+ timeout_s: float,
45
+ ) -> dict[str, Any]:
46
+ last_exc: Exception | None = None
47
+ for attempt in range(max_retries):
48
+ try:
49
+ async with httpx.AsyncClient(timeout=timeout_s) as client:
50
+ response = await client.post(url, json=payload)
51
+ response.raise_for_status()
52
+ return response.json()
53
+ except Exception as exc: # retry all transport errors
54
+ last_exc = exc
55
+ wait = min(2 ** attempt * 0.25, 5.0)
56
+ logger.debug("POST %s failed (attempt %d/%d): %s", url, attempt + 1, max_retries, exc)
57
+ await asyncio.sleep(wait)
58
+ raise RuntimeError(f"POST {url} failed after {max_retries} retries: {last_exc}")
59
+
60
+
61
+ class SreEnvClient:
62
+ """OpenClaw-RL-shaped client for the sre-gym pool server."""
63
+
64
+ def __init__(self, base_url: str) -> None:
65
+ self.base_url = base_url.rstrip("/")
66
+ self.default_max_retries = int(os.getenv("ENV_HTTP_MAX_RETRIES", "10"))
67
+ self.allocate_max_retries = int(os.getenv("ENV_ALLOCATE_MAX_RETRIES", "10"))
68
+ self.evaluate_max_retries = int(os.getenv("ENV_EVALUATE_MAX_RETRIES", "1"))
69
+ self.close_max_retries = int(os.getenv("ENV_CLOSE_MAX_RETRIES", "3"))
70
+ self.exec_tool_max_retries = int(os.getenv("ENV_EXEC_TOOL_MAX_RETRIES", "3"))
71
+ self.timeout_s = float(os.getenv("ENV_HTTP_TIMEOUT_S", "30"))
72
+
73
+ async def allocate(self, task_key: str, request_id: str | None = None) -> dict[str, Any]:
74
+ out = await _post(
75
+ f"{self.base_url}/allocate",
76
+ {"task_key": task_key, "request_id": request_id},
77
+ max_retries=self.allocate_max_retries,
78
+ timeout_s=self.timeout_s,
79
+ )
80
+ if not out.get("ok", False):
81
+ raise RuntimeError(f"allocate failed: {out}")
82
+ return out
83
+
84
+ async def heartbeat(self, lease_id: str) -> None:
85
+ out = await _post(
86
+ f"{self.base_url}/heartbeat",
87
+ {"lease_id": lease_id},
88
+ max_retries=self.default_max_retries,
89
+ timeout_s=self.timeout_s,
90
+ )
91
+ if not out.get("ok", False):
92
+ raise RuntimeError(f"heartbeat failed: {out}")
93
+
94
+ async def reset(
95
+ self,
96
+ lease_id: str,
97
+ task_meta: dict[str, Any],
98
+ run_ctx: dict[str, Any],
99
+ task_timeouts: dict[str, Any] | None = None,
100
+ ) -> dict[str, Any]:
101
+ out = await _post(
102
+ f"{self.base_url}/reset",
103
+ {
104
+ "lease_id": lease_id,
105
+ "task_meta": task_meta,
106
+ "run_ctx": run_ctx,
107
+ "task_timeouts": task_timeouts,
108
+ },
109
+ max_retries=self.default_max_retries,
110
+ timeout_s=self.timeout_s,
111
+ )
112
+ if not out.get("ok", False):
113
+ raise RuntimeError(f"reset failed: {out}")
114
+ return out
115
+
116
+ async def exec_tool(self, lease_id: str, tool_name: str, arguments: dict[str, Any]) -> str:
117
+ out = await _post(
118
+ f"{self.base_url}/exec_tool",
119
+ {
120
+ "lease_id": lease_id,
121
+ "tool_call": {"name": tool_name, "arguments": arguments},
122
+ },
123
+ max_retries=self.exec_tool_max_retries,
124
+ timeout_s=self.timeout_s,
125
+ )
126
+ if not out.get("ok", False):
127
+ raise RuntimeError(f"exec_tool failed: {out}")
128
+ return str(out.get("observation", ""))
129
+
130
+ async def evaluate(self, lease_id: str) -> float:
131
+ out = await _post(
132
+ f"{self.base_url}/evaluate",
133
+ {"lease_id": lease_id},
134
+ max_retries=self.evaluate_max_retries,
135
+ timeout_s=self.timeout_s,
136
+ )
137
+ if not out.get("ok", False):
138
+ raise RuntimeError(f"evaluate failed: {out}")
139
+ return float(out.get("score", 0.0))
140
+
141
+ async def close(self, lease_id: str) -> None:
142
+ try:
143
+ out = await _post(
144
+ f"{self.base_url}/close",
145
+ {"lease_id": lease_id},
146
+ max_retries=self.close_max_retries,
147
+ timeout_s=self.timeout_s,
148
+ )
149
+ except Exception as exc:
150
+ if "Unknown lease" in str(exc):
151
+ logger.debug("close(%s): lease already gone", lease_id)
152
+ return
153
+ raise
154
+ if not out.get("ok", False):
155
+ error_msg = str(out.get("error", ""))
156
+ if "Unknown" in error_msg and "lease" in error_msg.lower():
157
+ logger.debug("close(%s): lease already gone", lease_id)
158
+ return
159
+ raise RuntimeError(f"close failed: {out}")
openenv.yaml CHANGED
@@ -12,10 +12,10 @@ environment:
12
  observation_type: UnifiedIncidentObservation
13
  state_type: UnifiedIncidentState
14
  max_steps: 12
15
- difficulties: [easy]
16
  reward_type: dense
17
 
18
  huggingface:
19
- space_id: gylder/my-env
20
  sdk: docker
21
  hardware: cpu-basic
 
12
  observation_type: UnifiedIncidentObservation
13
  state_type: UnifiedIncidentState
14
  max_steps: 12
15
+ difficulties: [easy, medium, hard]
16
  reward_type: dense
17
 
18
  huggingface:
19
+ space_id: dakshdoesdev/sre-gym
20
  sdk: docker
21
  hardware: cpu-basic
skill/SKILL.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: sre-gym
3
+ description: SRE incident-response training environment with fault injection and deterministic grading. Use when the user wants to practice SRE skills, solve an injected production incident, or run one of three scenarios (worker_deploy_cascade / db_config_rollout / gateway_auth_rollout) against the sre-gym HTTP server. Invokes scripts in skill/tools/ to query the env and records verified runbooks after clean solves.
4
+ ---
5
+
6
+ # SRE Gym — Incident Response Skill
7
+
8
+ You are an SRE agent connected to a running sre-gym environment (HTTP, default `http://127.0.0.1:8000`). The env simulates production incidents with decoy services, deterministic grading, and explicit resolution checks. Your job is to diagnose from evidence, pick the correct remediation, verify recovery, then declare resolved.
9
+
10
+ ## When to use this skill
11
+
12
+ - The user names a scenario (`worker_deploy_cascade`, `db_config_rollout`, `gateway_auth_rollout`) or says "solve an incident / run SRE scenario"
13
+ - The user asks you to practice, benchmark, or demo incident response
14
+ - The user points you at an sre-gym URL
15
+
16
+ ## Core rules (never break these)
17
+
18
+ 1. **Never guess at remediation.** Query evidence (`query_logs`, `query_deploys`, `query_metrics`) before `rollback_deploy` / `restart_service`.
19
+ 2. **Root cause before restart.** Restarting a service before rolling back the triggering change re-inherits the bad state.
20
+ 3. **Never call `declare_resolved` before the scenario's resolution check passes.** Each scenario specifies which check is required; read it from `observation.checks` and from any loaded runbook.
21
+ 4. **Watch for decoys.** Each scenario has a plausible-looking wrong answer. Example: `db_config_rollout` has a recent worker deploy that is *not* the cause. Read logs before committing to a target.
22
+ 5. **Repeating the same no-progress action wastes ticks.** The env emits `loop_warning` when you do this — treat it as a hard signal to try a different evidence source.
23
+
24
+ ## Workflow
25
+
26
+ ### 1. Load prior knowledge
27
+
28
+ Before your first action, check `skill/verified-runbooks/{scenario_id}.md`. If it exists, read it — it's a log of previously-successful solves for this exact scenario, written by earlier runs of this skill. Use the winning path and the decoy list.
29
+
30
+ ### 2. Drive the env
31
+
32
+ Use `skill/tools/sre_gym_client.py` to call the env:
33
+
34
+ ```bash
35
+ python skill/tools/sre_gym_client.py list # show available scenarios
36
+ python skill/tools/sre_gym_client.py reset <id> # start an episode
37
+ python skill/tools/sre_gym_client.py step '<json>' # take one action
38
+ python skill/tools/sre_gym_client.py status # current obs + grader
39
+ ```
40
+
41
+ Action JSON matches the env's `UnifiedIncidentAction` model. Examples:
42
+ ```json
43
+ {"action_type": "query_logs", "service": "database"}
44
+ {"action_type": "query_deploys", "service": "worker"}
45
+ {"action_type": "rollback_deploy", "service": "database"}
46
+ {"action_type": "run_check", "check_name": "end_to_end"}
47
+ {"action_type": "declare_resolved"}
48
+ ```
49
+
50
+ ### 3. Investigation loop (per tick)
51
+
52
+ 1. Read `observation.prompt_text` — services, alerts, last result, failure_type, why_failed.
53
+ 2. If `observation.failure_type` is set, your previous action was rejected — **do not repeat it**, read `why_failed` and pick a different evidence source or remediation.
54
+ 3. Form a hypothesis with `submit_hypothesis` once you have enough evidence (usually 2–4 queries). Calibrate `confidence`: ≥0.7 only if you're sure.
55
+ 4. Remediate (`rollback_deploy` → `restart_service` if scenario requires → `run_check`).
56
+ 5. `declare_resolved` only after the required check passes.
57
+
58
+ ### 4. Record the runbook
59
+
60
+ If the episode finishes with `incident_resolved=true` and `final_score > 0.85`, run:
61
+
62
+ ```bash
63
+ python skill/tools/sre_gym_client.py record-runbook <scenario_id>
64
+ ```
65
+
66
+ This appends a new entry to `skill/verified-runbooks/{scenario_id}.md`. Future runs of this skill (yours or another Claude's) load it automatically.
67
+
68
+ ## Action reference (11 actions)
69
+
70
+ | Action | Required fields | Purpose |
71
+ |---|---|---|
72
+ | `query_logs` | `service` | Read service-level error logs |
73
+ | `query_metrics` | `service`, `metric` (cpu/error_rate/latency) | Read quantitative signals |
74
+ | `query_dependencies` | `service` | Map upstream/downstream |
75
+ | `query_deploys` | `service` | Recent deploy history |
76
+ | `rollback_deploy` | `service` | Revert last deploy — SCENARIO-SPECIFIC TARGET |
77
+ | `restart_service` | `service` | Reboot a service (usually after rollback) |
78
+ | `run_check` | `check_name` (`database_recovery` / `end_to_end`) | Objective recovery check |
79
+ | `isolate_service` | `service` | Containment only, does not resolve |
80
+ | `escalate` | — | Record escalation note |
81
+ | `submit_hypothesis` | `hypothesis` object | Commit RCA with confidence calibration |
82
+ | `declare_resolved` | — | Finalize; rejected if required check has not passed |
83
+
84
+ ## Scoring rubric (deterministic from the env)
85
+
86
+ - **Recovery (0–0.4):** services healthy on the critical path
87
+ - **Containment (0–0.3):** root cause removed OR offending service isolated
88
+ - **Verification (0–0.35):** both checks passed
89
+ - **Impact (0–0.15):** user_impact reduced
90
+ - **Efficiency (0–0.10):** budget preserved, no wasteful repeats
91
+
92
+ Clean solve target: **> 0.85**. That's the runbook-record threshold.
93
+
94
+ ## Decoy knowledge (read before hypothesizing)
95
+
96
+ - `worker_deploy_cascade`: the only true cause; no decoys.
97
+ - `db_config_rollout`: the recent worker deploy is a **decoy**. Rolling back worker yields `wrong_remediation_target`.
98
+ - `gateway_auth_rollout`: the recent worker deploy (`worker@...-hotfix` — log-format tweak) is a **decoy**. The gateway auth rollout is the cause.
99
+
100
+ If you take a wrong remediation, the env returns `failure_type="wrong_remediation_target"` and a negative reward — **do not retry the same wrong target**, re-read the logs.
skill/tools/sre_gym_client.py ADDED
@@ -0,0 +1,238 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """CLI client for the sre-gym skill.
3
+
4
+ Usage:
5
+ sre_gym_client.py list
6
+ sre_gym_client.py solve <scenario_id> [--policy baseline]
7
+ sre_gym_client.py interactive <scenario_id> # stdin: one JSON action per line
8
+ sre_gym_client.py record-runbook <scenario_id> <session.json>
9
+
10
+ Because OpenEnv's HTTP /reset and /step handlers create a fresh environment per
11
+ call, episode state only persists within a single client session. This CLI wraps
12
+ one episode inside one Python process so the session is preserved.
13
+
14
+ SRE_GYM_URL env var overrides the base URL (default http://127.0.0.1:8000).
15
+ """
16
+
17
+ from __future__ import annotations
18
+
19
+ import datetime as _dt
20
+ import json
21
+ import os
22
+ import sys
23
+ from pathlib import Path
24
+ from typing import Any
25
+
26
+ # Make the sibling package importable whether the script is invoked from the
27
+ # repo root or from the skill/ directory directly.
28
+ _REPO_ROOT = Path(__file__).resolve().parent.parent.parent
29
+ if str(_REPO_ROOT) not in sys.path:
30
+ sys.path.insert(0, str(_REPO_ROOT))
31
+
32
+ from unified_incident_env.client import UnifiedIncidentEnv # noqa: E402
33
+ from unified_incident_env.models import UnifiedIncidentAction, UnifiedIncidentObservation # noqa: E402
34
+ from unified_incident_env.server.challenge import SCENARIOS, list_baselines # noqa: E402
35
+
36
+ BASE_URL = os.environ.get("SRE_GYM_URL", "http://127.0.0.1:8000").rstrip("/")
37
+ RUNBOOK_DIR = Path(__file__).resolve().parent.parent / "verified-runbooks"
38
+ SCORE_THRESHOLD = 0.85
39
+
40
+
41
+ def _clean_action(action: UnifiedIncidentAction) -> dict[str, Any]:
42
+ data = action.model_dump(exclude_none=True)
43
+ if data.get("metadata") == {}:
44
+ data.pop("metadata")
45
+ hypothesis = data.get("hypothesis")
46
+ if isinstance(hypothesis, dict) and hypothesis.get("metadata") == {}:
47
+ hypothesis.pop("metadata", None)
48
+ return data
49
+
50
+
51
+ def _summarize_obs(obs: UnifiedIncidentObservation) -> dict[str, Any]:
52
+ return {
53
+ "tick": obs.tick_count,
54
+ "workflow_stage": obs.workflow_stage,
55
+ "last_action_result": obs.last_action_result,
56
+ "tool_output": obs.tool_output,
57
+ "failure_type": obs.failure_type,
58
+ "why_failed": obs.why_failed,
59
+ "loop_warning": obs.loop_warning,
60
+ "checks": [{"name": c.name, "passed": c.passed} for c in obs.checks],
61
+ "final_score": obs.final_score,
62
+ "incident_resolved": obs.incident_resolved,
63
+ }
64
+
65
+
66
+ def _session_path(scenario_id: str) -> Path:
67
+ return Path(f"/tmp/sre_gym_session.{scenario_id}.json")
68
+
69
+
70
+ def cmd_list() -> None:
71
+ for scenario in SCENARIOS.values():
72
+ print(f" {scenario['difficulty']:<6} {scenario['id']:<25} {scenario['name']}")
73
+
74
+
75
+ def cmd_solve(scenario_id: str, policy: str = "baseline") -> None:
76
+ """Run an entire episode end-to-end inside one process."""
77
+ if scenario_id not in SCENARIOS:
78
+ print(f"error: unknown scenario {scenario_id!r}", file=sys.stderr)
79
+ sys.exit(2)
80
+ if policy != "baseline":
81
+ print(f"error: unknown policy {policy!r} (only 'baseline' available)", file=sys.stderr)
82
+ sys.exit(2)
83
+
84
+ trace: list[dict[str, Any]] = []
85
+ with UnifiedIncidentEnv(base_url=BASE_URL).sync() as env:
86
+ obs = env.reset(scenario_id=scenario_id).observation
87
+ print(f"[reset] scenario={scenario_id} difficulty={obs.difficulty}")
88
+ for step in list_baselines(scenario_id).baselines[0].actions:
89
+ result = env.step(step.action)
90
+ obs = result.observation
91
+ record = {
92
+ "step": obs.tick_count,
93
+ "action": _clean_action(step.action),
94
+ "rationale": step.rationale,
95
+ "reward": result.reward,
96
+ **_summarize_obs(obs),
97
+ }
98
+ trace.append(record)
99
+ action_repr = json.dumps(record["action"], separators=(",", ":"))
100
+ print(f"[step {obs.tick_count}] action={action_repr} reward={result.reward:+.2f} score={obs.final_score:.2f}")
101
+ if result.done:
102
+ break
103
+ final = _summarize_obs(obs)
104
+
105
+ _session_path(scenario_id).write_text(
106
+ json.dumps({"scenario_id": scenario_id, "trace": trace, "final": final}, indent=2),
107
+ encoding="utf-8",
108
+ )
109
+ print(
110
+ f"[done] resolved={final['incident_resolved']} score={final['final_score']:.2f} "
111
+ f"steps={final['tick']} session={_session_path(scenario_id)}"
112
+ )
113
+
114
+
115
+ def cmd_interactive(scenario_id: str) -> None:
116
+ """One JSON action per stdin line. Preserves session for the whole process lifetime."""
117
+ if scenario_id not in SCENARIOS:
118
+ print(f"error: unknown scenario {scenario_id!r}", file=sys.stderr)
119
+ sys.exit(2)
120
+
121
+ trace: list[dict[str, Any]] = []
122
+ with UnifiedIncidentEnv(base_url=BASE_URL).sync() as env:
123
+ obs = env.reset(scenario_id=scenario_id).observation
124
+ print(json.dumps({"event": "reset", "scenario_id": scenario_id, "obs": _summarize_obs(obs)}), flush=True)
125
+ for line in sys.stdin:
126
+ line = line.strip()
127
+ if not line:
128
+ continue
129
+ try:
130
+ data = json.loads(line)
131
+ action = UnifiedIncidentAction(**data)
132
+ except Exception as exc:
133
+ print(json.dumps({"event": "error", "detail": str(exc)}), flush=True)
134
+ continue
135
+ result = env.step(action)
136
+ obs = result.observation
137
+ record = {"step": obs.tick_count, "action": _clean_action(action), "reward": result.reward, **_summarize_obs(obs)}
138
+ trace.append(record)
139
+ print(json.dumps({"event": "step", **record}), flush=True)
140
+ if result.done:
141
+ print(json.dumps({"event": "done", "final": _summarize_obs(obs)}), flush=True)
142
+ break
143
+
144
+ _session_path(scenario_id).write_text(
145
+ json.dumps({"scenario_id": scenario_id, "trace": trace, "final": _summarize_obs(obs)}, indent=2),
146
+ encoding="utf-8",
147
+ )
148
+
149
+
150
+ def cmd_record_runbook(scenario_id: str, session_file: str | None = None) -> None:
151
+ """Append a new runbook entry if the referenced session cleared the threshold."""
152
+ path = Path(session_file) if session_file else _session_path(scenario_id)
153
+ if not path.exists():
154
+ print(f"error: no session file at {path}", file=sys.stderr)
155
+ sys.exit(2)
156
+ session = json.loads(path.read_text(encoding="utf-8"))
157
+ final = session.get("final", {})
158
+ score = float(final.get("final_score", 0.0))
159
+
160
+ if not final.get("incident_resolved"):
161
+ print(f"skip: session not resolved (resolved={final.get('incident_resolved')})", file=sys.stderr)
162
+ sys.exit(1)
163
+ if score < SCORE_THRESHOLD:
164
+ print(f"skip: score {score:.2f} below runbook threshold {SCORE_THRESHOLD:.2f}", file=sys.stderr)
165
+ sys.exit(1)
166
+
167
+ RUNBOOK_DIR.mkdir(parents=True, exist_ok=True)
168
+ runbook_path = RUNBOOK_DIR / f"{scenario_id}.md"
169
+
170
+ timestamp = _dt.datetime.now(_dt.timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
171
+ steps = int(final.get("tick", 0))
172
+ checks_passed = [c["name"] for c in final.get("checks", []) if c.get("passed")]
173
+ trace = session.get("trace", [])
174
+
175
+ header = (
176
+ f"# verified-runbooks/{scenario_id}.md\n\n"
177
+ "Runbook entries are written by the sre-gym skill after a successful solve "
178
+ f"(incident_resolved=true and final_score > {SCORE_THRESHOLD:.2f}).\n"
179
+ "Each entry is immutable evidence — treat it as ground truth for the winning path.\n\n---\n"
180
+ )
181
+ lines = [f"\n## Run {timestamp} — Score {score:.2f}\n"]
182
+ lines.append(f"- Steps: **{steps}**")
183
+ lines.append(f"- Checks passed: {', '.join(checks_passed) or 'none'}")
184
+ lines.append("")
185
+ lines.append("**Winning path:**")
186
+ for entry in trace:
187
+ act = entry["action"]
188
+ action_type = act.get("action_type")
189
+ extras = ", ".join(
190
+ f"{k}={v if not isinstance(v, dict) else v.get('root_cause', v)}"
191
+ for k, v in act.items()
192
+ if k != "action_type" and v not in (None, {})
193
+ )
194
+ extra_str = f" ({extras})" if extras else ""
195
+ rationale = entry.get("rationale", "").rstrip(".")
196
+ lines.append(f"{entry['step']}. `{action_type}{extra_str}` — {rationale}")
197
+ lines.append("")
198
+ entry_text = "\n".join(lines)
199
+
200
+ if not runbook_path.exists():
201
+ runbook_path.write_text(header + entry_text, encoding="utf-8")
202
+ else:
203
+ with runbook_path.open("a", encoding="utf-8") as f:
204
+ f.write(entry_text)
205
+ print(f"recorded runbook entry → {runbook_path} (score {score:.2f}, {steps} steps)")
206
+
207
+
208
+ def main() -> None:
209
+ argv = sys.argv[1:]
210
+ if not argv:
211
+ print(__doc__, file=sys.stderr)
212
+ sys.exit(2)
213
+ cmd, *rest = argv
214
+ if cmd == "list":
215
+ cmd_list()
216
+ elif cmd == "solve":
217
+ if not rest:
218
+ print("error: solve requires <scenario_id>", file=sys.stderr)
219
+ sys.exit(2)
220
+ cmd_solve(rest[0], rest[1] if len(rest) > 1 else "baseline")
221
+ elif cmd == "interactive":
222
+ if not rest:
223
+ print("error: interactive requires <scenario_id>", file=sys.stderr)
224
+ sys.exit(2)
225
+ cmd_interactive(rest[0])
226
+ elif cmd == "record-runbook":
227
+ if not rest:
228
+ print("error: record-runbook requires <scenario_id>", file=sys.stderr)
229
+ sys.exit(2)
230
+ cmd_record_runbook(rest[0], rest[1] if len(rest) > 1 else None)
231
+ else:
232
+ print(f"error: unknown command {cmd!r}", file=sys.stderr)
233
+ print(__doc__, file=sys.stderr)
234
+ sys.exit(2)
235
+
236
+
237
+ if __name__ == "__main__":
238
+ main()
skill/verified-runbooks/.gitkeep ADDED
File without changes
skill/verified-runbooks/db_config_rollout.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # verified-runbooks/db_config_rollout.md
2
+
3
+ Runbook entries are written by the sre-gym skill after a successful solve (incident_resolved=true and final_score > 0.85).
4
+ Each entry is immutable evidence — treat it as ground truth for the winning path.
5
+
6
+ ---
7
+
8
+ ## Run 2026-04-23T22:01:33Z — Score 0.99
9
+
10
+ - Steps: **10**
11
+ - Checks passed: database_recovery, end_to_end
12
+
13
+ **Winning path:**
14
+ 1. `query_logs (service=database)` — Database is the loudest alert; inspect logs for the actual error signature
15
+ 2. `query_deploys (service=database)` — Pool-acquire errors suggest a config change; check recent database rollouts
16
+ 3. `query_metrics (service=database, metric=error_rate)` — Confirm the error pattern is pool exhaustion rather than compute overload
17
+ 4. `query_logs (service=worker)` — Rule out the decoy worker deploy by reading worker logs directly
18
+ 5. `submit_hypothesis (hypothesis=database_only_failure)` — Localize the fault to the database config before remediating
19
+ 6. `rollback_deploy (service=database)` — Roll back the offending database config rollout
20
+ 7. `restart_service (service=database)` — Restart the database cleanly against the restored pool config
21
+ 8. `run_check (check_name=database_recovery)` — Verify database pool health and write latency are back within SLO
22
+ 9. `run_check (check_name=end_to_end)` — Verify gateway write-path traffic succeeds end-to-end
23
+ 10. `declare_resolved` — Declare resolved only after objective checks pass
skill/verified-runbooks/gateway_auth_rollout.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # verified-runbooks/gateway_auth_rollout.md
2
+
3
+ Runbook entries are written by the sre-gym skill after a successful solve (incident_resolved=true and final_score > 0.85).
4
+ Each entry is immutable evidence — treat it as ground truth for the winning path.
5
+
6
+ ---
7
+
8
+ ## Run 2026-04-23T22:01:37Z — Score 0.99
9
+
10
+ - Steps: **8**
11
+ - Checks passed: database_recovery, end_to_end
12
+
13
+ **Winning path:**
14
+ 1. `query_logs (service=api-gateway)` — Gateway is rejecting logins; read gateway logs to localize the rejection class
15
+ 2. `query_deploys (service=api-gateway)` — Login rejection aligns with a recent auth middleware rollout; confirm deploy timing
16
+ 3. `query_deploys (service=worker)` — Rule out the worker deploy explicitly rather than assuming
17
+ 4. `submit_hypothesis (hypothesis=api_gateway_fault)` — Commit a calibrated hypothesis localizing to the gateway auth rollout
18
+ 5. `rollback_deploy (service=api-gateway)` — Roll back the bad auth middleware rollout; no restart needed
19
+ 6. `run_check (check_name=end_to_end)` — Verify that gateway login traffic now succeeds end-to-end
20
+ 7. `run_check (check_name=database_recovery)` — Confirm the database is (and stayed) healthy throughout
21
+ 8. `declare_resolved` — Declare resolved only after objective checks pass
skill/verified-runbooks/worker_deploy_cascade.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # verified-runbooks/worker_deploy_cascade.md
2
+
3
+ Runbook entries are written by the sre-gym skill after a successful solve (incident_resolved=true and final_score > 0.85).
4
+ Each entry is immutable evidence — treat it as ground truth for the winning path.
5
+
6
+ ---
7
+
8
+ ## Run 2026-04-23T22:01:29Z — Score 0.99
9
+
10
+ - Steps: **10**
11
+ - Checks passed: database_recovery, end_to_end
12
+
13
+ **Winning path:**
14
+ 1. `query_deploys (service=worker)` — Check whether any recent deploy aligns with the incident start
15
+ 2. `query_logs (service=worker)` — Inspect worker logs because deploy timing and queue pressure suggest worker-originated harm
16
+ 3. `query_metrics (service=database, metric=cpu)` — Confirm that the database is overloaded as a downstream effect
17
+ 4. `query_dependencies (service=api-gateway)` — Verify the gateway depends on the worker and database path
18
+ 5. `submit_hypothesis (hypothesis=bad_worker_deploy)` — Commit a calibrated hypothesis before taking an invasive mitigation step
19
+ 6. `rollback_deploy (service=worker)` — Remove the triggering change before restarting downstream services
20
+ 7. `restart_service (service=database)` — Bring the database back cleanly after the root cause is removed
21
+ 8. `run_check (check_name=database_recovery)` — Verify the database is no longer crashing
22
+ 9. `run_check (check_name=end_to_end)` — Verify gateway traffic succeeds end-to-end
23
+ 10. `declare_resolved` — Declare resolved only after objective checks pass
train/collect_trajectories.py ADDED
@@ -0,0 +1,471 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Parallel async harness for collecting Claude-driven sre-gym trajectories.
2
+
3
+ Example:
4
+
5
+ python train/collect_trajectories.py \
6
+ --env-url https://dakshdoesdev-sre-gym.hf.space \
7
+ --scenarios worker_deploy_cascade,db_config_rollout,gateway_auth_rollout \
8
+ --models claude-sonnet-4-6,claude-haiku-4-5-20251001 \
9
+ --episodes-per-model 1000 \
10
+ --parallelism 20 \
11
+ --output data/trajectories.jsonl
12
+
13
+ `--episodes-per-model` is total episodes per model across the resolved scenario
14
+ set. Scenario assignment is round-robin so every requested scenario receives
15
+ coverage over a long run.
16
+ """
17
+
18
+ from __future__ import annotations
19
+
20
+ import argparse
21
+ import asyncio
22
+ import json
23
+ import os
24
+ import sys
25
+ import time
26
+ import uuid
27
+ from dataclasses import dataclass
28
+ from pathlib import Path
29
+ from typing import Any
30
+
31
+ import httpx
32
+
33
+ REPO_ROOT = Path(__file__).resolve().parents[1]
34
+ if str(REPO_ROOT) not in sys.path:
35
+ sys.path.insert(0, str(REPO_ROOT))
36
+
37
+ try:
38
+ from anthropic import AsyncAnthropic
39
+ except ImportError: # pragma: no cover - handled at runtime in anthropic mode
40
+ AsyncAnthropic = None # type: ignore[assignment]
41
+
42
+ from unified_incident_env.client import UnifiedIncidentEnv
43
+ from unified_incident_env.models import UnifiedIncidentAction, UnifiedIncidentObservation
44
+ from unified_incident_env.server.challenge import SCENARIOS, SUPPORTED_DIFFICULTIES
45
+
46
+ SYSTEM_PROMPT = (
47
+ "You are collecting trajectories for a deterministic SRE incident benchmark.\n"
48
+ "Return exactly one JSON object and nothing else.\n"
49
+ "Choose only from the allowed action types shown in the prompt.\n"
50
+ "Use only the required fields for the chosen action.\n"
51
+ "Do not include markdown, prose, or code fences."
52
+ )
53
+ METRIC_OPTIONS = ("cpu", "error_rate", "latency")
54
+ CHECK_OPTIONS = ("database_recovery", "end_to_end")
55
+ ROOT_CAUSE_OPTIONS = (
56
+ "bad_worker_deploy",
57
+ "database_only_failure",
58
+ "api_gateway_fault",
59
+ )
60
+
61
+
62
+ @dataclass(frozen=True)
63
+ class EpisodeJob:
64
+ model: str
65
+ scenario_id: str
66
+ ordinal: int
67
+
68
+
69
+ def parse_args() -> argparse.Namespace:
70
+ parser = argparse.ArgumentParser(description=__doc__.splitlines()[0])
71
+ parser.add_argument("--env-url", required=True, help="sre-gym server base URL")
72
+ parser.add_argument("--scenarios", required=True, help="comma-separated scenario ids, difficulties, or all")
73
+ parser.add_argument("--models", required=True, help="comma-separated Anthropic model ids")
74
+ parser.add_argument("--episodes-per-model", type=int, default=1000)
75
+ parser.add_argument("--parallelism", type=int, default=20)
76
+ parser.add_argument("--output", required=True, help="output JSONL path")
77
+ parser.add_argument("--driver", choices=("anthropic", "heuristic"), default="anthropic")
78
+ parser.add_argument("--anthropic-api-key", default=os.getenv("ANTHROPIC_API_KEY"))
79
+ parser.add_argument("--anthropic-base-url", default=os.getenv("ANTHROPIC_BASE_URL"))
80
+ parser.add_argument("--max-tokens", type=int, default=320)
81
+ parser.add_argument("--env-timeout-s", type=float, default=45.0)
82
+ parser.add_argument("--anthropic-timeout-s", type=float, default=90.0)
83
+ parser.add_argument("--max-retries", type=int, default=3)
84
+ return parser.parse_args()
85
+
86
+
87
+ def _split_csv(raw: str) -> list[str]:
88
+ return [token.strip() for token in raw.split(",") if token.strip()]
89
+
90
+
91
+ def _resolve_scenarios(raw: str) -> list[str]:
92
+ scenario_ids: list[str] = []
93
+ for token in _split_csv(raw):
94
+ if token == "all":
95
+ scenario_ids.extend(SCENARIOS.keys())
96
+ continue
97
+ if token in SUPPORTED_DIFFICULTIES:
98
+ scenario_ids.extend(
99
+ scenario_id
100
+ for scenario_id, scenario in SCENARIOS.items()
101
+ if scenario["difficulty"] == token
102
+ )
103
+ continue
104
+ if token not in SCENARIOS:
105
+ raise SystemExit(f"Unknown scenario selector: {token}")
106
+ scenario_ids.append(token)
107
+ deduped: list[str] = []
108
+ seen: set[str] = set()
109
+ for scenario_id in scenario_ids:
110
+ if scenario_id not in seen:
111
+ deduped.append(scenario_id)
112
+ seen.add(scenario_id)
113
+ if not deduped:
114
+ raise SystemExit("No scenarios resolved from --scenarios")
115
+ return deduped
116
+
117
+
118
+ def _resolve_models(raw: str) -> list[str]:
119
+ models = _split_csv(raw)
120
+ if not models:
121
+ raise SystemExit("No models resolved from --models")
122
+ return models
123
+
124
+
125
+ def _service_order(observation: UnifiedIncidentObservation) -> list[str]:
126
+ services = list(observation.service_health.items())
127
+ services.sort(
128
+ key=lambda item: (
129
+ item[1].status == "healthy",
130
+ item[1].status == "isolated",
131
+ item[1].error_rate_pct,
132
+ item[1].latency_ms,
133
+ ),
134
+ reverse=True,
135
+ )
136
+ return [name for name, _payload in services]
137
+
138
+
139
+ def _default_action_for_type(action_type: str, observation: UnifiedIncidentObservation) -> dict[str, Any]:
140
+ services = _service_order(observation)
141
+ service = services[0] if services else "database"
142
+ if action_type in {"query_logs", "query_dependencies", "query_deploys", "rollback_deploy", "restart_service", "isolate_service"}:
143
+ return {"action_type": action_type, "service": service}
144
+ if action_type == "query_metrics":
145
+ return {"action_type": action_type, "service": service, "metric": "cpu"}
146
+ if action_type == "run_check":
147
+ pending_checks = [check.name for check in observation.checks if not check.passed]
148
+ check_name = pending_checks[0] if pending_checks else "end_to_end"
149
+ return {"action_type": action_type, "check_name": check_name}
150
+ if action_type == "submit_hypothesis":
151
+ return {
152
+ "action_type": "submit_hypothesis",
153
+ "hypothesis": {
154
+ "root_cause": ROOT_CAUSE_OPTIONS[0],
155
+ "affected_services": services[:2] or ["database"],
156
+ "confidence": 0.5,
157
+ "recommended_next_action": "query_logs",
158
+ },
159
+ }
160
+ return {"action_type": action_type}
161
+
162
+
163
+ def _build_fallback_action(observation: UnifiedIncidentObservation) -> UnifiedIncidentAction:
164
+ pending_checks = [check.name for check in observation.checks if not check.passed]
165
+ if observation.workflow_stage == "validation" and pending_checks:
166
+ return UnifiedIncidentAction(action_type="run_check", check_name=pending_checks[0])
167
+ if observation.workflow_stage == "validation" and not pending_checks:
168
+ return UnifiedIncidentAction(action_type="declare_resolved")
169
+ if observation.workflow_stage == "mitigation":
170
+ services = _service_order(observation)
171
+ service = services[0] if services else "database"
172
+ if "rollback_deploy" in observation.allowed_actions:
173
+ return UnifiedIncidentAction(action_type="rollback_deploy", service=service)
174
+ if "restart_service" in observation.allowed_actions:
175
+ return UnifiedIncidentAction(action_type="restart_service", service=service)
176
+ if "query_logs" in observation.allowed_actions:
177
+ services = _service_order(observation)
178
+ service = services[0] if services else "database"
179
+ return UnifiedIncidentAction(action_type="query_logs", service=service)
180
+ if "query_deploys" in observation.allowed_actions:
181
+ services = _service_order(observation)
182
+ service = services[0] if services else "database"
183
+ return UnifiedIncidentAction(action_type="query_deploys", service=service)
184
+ action_type = observation.allowed_actions[0]
185
+ return UnifiedIncidentAction(**_default_action_for_type(action_type, observation))
186
+
187
+
188
+ def _extract_json_object(raw_text: str) -> str:
189
+ text = raw_text.strip()
190
+ if "```" in text:
191
+ parts = text.split("```")
192
+ if len(parts) >= 2:
193
+ text = parts[1]
194
+ if text.startswith("json"):
195
+ text = text[4:]
196
+ start = text.find("{")
197
+ end = text.rfind("}")
198
+ if start != -1 and end != -1 and start < end:
199
+ return text[start : end + 1].strip()
200
+ return text
201
+
202
+
203
+ def _parse_action(raw_text: str, observation: UnifiedIncidentObservation) -> UnifiedIncidentAction | None:
204
+ candidate = _extract_json_object(raw_text)
205
+ if not candidate:
206
+ return None
207
+ try:
208
+ payload = json.loads(candidate)
209
+ except Exception:
210
+ return None
211
+ if not isinstance(payload, dict):
212
+ return None
213
+ if "action" in payload and "action_type" not in payload and isinstance(payload["action"], str):
214
+ payload["action_type"] = payload.pop("action")
215
+ if payload.get("action_type") not in observation.allowed_actions:
216
+ return None
217
+ try:
218
+ return UnifiedIncidentAction(**payload)
219
+ except Exception:
220
+ return None
221
+
222
+
223
+ def _build_user_prompt(observation: UnifiedIncidentObservation) -> str:
224
+ required_lines = []
225
+ for action_name, fields in observation.required_fields_by_action.items():
226
+ required_lines.append(
227
+ f"- {action_name}: {', '.join(fields) if fields else '(no extra fields)'}"
228
+ )
229
+ service_names = ", ".join(sorted(observation.service_health))
230
+ return (
231
+ f"{observation.prompt_text}\n\n"
232
+ "JSON_RESPONSE_RULES:\n"
233
+ "- Return exactly one JSON object.\n"
234
+ "- Use only an allowed action_type.\n"
235
+ "- Include only the fields required for that action.\n"
236
+ f"- service must be one of: {service_names}\n"
237
+ f"- metric must be one of: {', '.join(METRIC_OPTIONS)}\n"
238
+ f"- check_name must be one of: {', '.join(CHECK_OPTIONS)}\n"
239
+ f"- hypothesis.root_cause must be one of: {', '.join(ROOT_CAUSE_OPTIONS)}\n"
240
+ "- hypothesis must include root_cause, affected_services, confidence, and recommended_next_action.\n"
241
+ "- Noise alerts are decoys; querying them hurts score.\n\n"
242
+ "REQUIRED_FIELDS_BY_ACTION:\n"
243
+ + "\n".join(required_lines)
244
+ )
245
+
246
+
247
+ def _extract_text_response(message: Any) -> str:
248
+ parts = []
249
+ for block in getattr(message, "content", []):
250
+ if getattr(block, "type", "") == "text":
251
+ parts.append(getattr(block, "text", ""))
252
+ return "".join(parts).strip()
253
+
254
+
255
+ async def _request_model_output(
256
+ *,
257
+ driver: str,
258
+ anthropic_client: Any,
259
+ model: str,
260
+ prompt: str,
261
+ fallback_action: UnifiedIncidentAction,
262
+ max_tokens: int,
263
+ max_retries: int,
264
+ ) -> tuple[str, str | None]:
265
+ if driver == "heuristic":
266
+ return json.dumps(fallback_action.model_dump(exclude_none=True), separators=(",", ":")), "heuristic_driver"
267
+ last_error: str | None = None
268
+ for attempt in range(1, max_retries + 1):
269
+ try:
270
+ message = await anthropic_client.messages.create(
271
+ model=model,
272
+ max_tokens=max_tokens,
273
+ temperature=0.0,
274
+ system=SYSTEM_PROMPT,
275
+ messages=[{"role": "user", "content": prompt}],
276
+ )
277
+ text = _extract_text_response(message)
278
+ if text:
279
+ return text, None
280
+ last_error = "empty_text_response"
281
+ except Exception as exc: # pragma: no cover - exercised in real collection runs
282
+ last_error = f"{type(exc).__name__}: {exc}"
283
+ if attempt < max_retries:
284
+ await asyncio.sleep(min(2.0 * attempt, 5.0))
285
+ return json.dumps(fallback_action.model_dump(exclude_none=True), separators=(",", ":")), last_error or "model_request_failed"
286
+
287
+
288
+ async def _collect_episode(
289
+ job: EpisodeJob,
290
+ *,
291
+ anthropic_client: Any,
292
+ args: argparse.Namespace,
293
+ ) -> dict[str, Any]:
294
+ trajectory: list[dict[str, Any]] = []
295
+ started = time.perf_counter()
296
+ steps = 0
297
+ async with UnifiedIncidentEnv(base_url=args.env_url) as env:
298
+ observation = (await env.reset(scenario_id=job.scenario_id, episode_id=str(uuid.uuid4()))).observation
299
+ while not observation.done:
300
+ prompt = _build_user_prompt(observation)
301
+ fallback_action = _build_fallback_action(observation)
302
+ response_text, driver_note = await _request_model_output(
303
+ driver=args.driver,
304
+ anthropic_client=anthropic_client,
305
+ model=job.model,
306
+ prompt=prompt,
307
+ fallback_action=fallback_action,
308
+ max_tokens=args.max_tokens,
309
+ max_retries=args.max_retries,
310
+ )
311
+ parsed_action = _parse_action(response_text, observation)
312
+ action = parsed_action or fallback_action
313
+ next_step = await env.step(action)
314
+ next_observation = next_step.observation
315
+ step_failure = next_observation.failure_type
316
+ if parsed_action is None and driver_note is None:
317
+ driver_note = "invalid_model_output"
318
+ if driver_note is not None and action == fallback_action:
319
+ step_failure = step_failure or driver_note
320
+ trajectory.append(
321
+ {
322
+ "tick": observation.tick_count,
323
+ "prompt": prompt,
324
+ "response_text": response_text,
325
+ "action": action.model_dump(exclude_none=True),
326
+ "reward": float(next_observation.reward),
327
+ "tool_output": next_observation.tool_output,
328
+ "failure_type": step_failure,
329
+ "workflow_stage": next_observation.workflow_stage,
330
+ }
331
+ )
332
+ observation = next_observation
333
+ steps += 1
334
+ return {
335
+ "episode_id": str(uuid.uuid4()),
336
+ "scenario_id": job.scenario_id,
337
+ "model": job.model,
338
+ "final_score": float(observation.final_score),
339
+ "incident_resolved": bool(observation.incident_resolved),
340
+ "steps": steps,
341
+ "elapsed_s": round(time.perf_counter() - started, 4),
342
+ "trajectory": trajectory,
343
+ }
344
+
345
+
346
+ async def _worker(
347
+ *,
348
+ name: str,
349
+ jobs: asyncio.Queue[EpisodeJob],
350
+ anthropic_client: Any,
351
+ args: argparse.Namespace,
352
+ write_lock: asyncio.Lock,
353
+ output_path: Path,
354
+ counters: dict[str, int],
355
+ ) -> None:
356
+ while True:
357
+ job = await jobs.get()
358
+ try:
359
+ record = await _collect_episode(
360
+ job,
361
+ anthropic_client=anthropic_client,
362
+ args=args,
363
+ )
364
+ async with write_lock:
365
+ with output_path.open("a", encoding="utf-8") as handle:
366
+ handle.write(json.dumps(record))
367
+ handle.write("\n")
368
+ counters["completed"] += 1
369
+ if record["incident_resolved"]:
370
+ counters["resolved"] += 1
371
+ print(
372
+ f"[{counters['completed']}/{counters['total']}] worker={name} model={job.model} "
373
+ f"scenario={job.scenario_id} score={record['final_score']:.3f} "
374
+ f"resolved={str(record['incident_resolved']).lower()} steps={record['steps']}",
375
+ file=sys.stderr,
376
+ flush=True,
377
+ )
378
+ finally:
379
+ jobs.task_done()
380
+
381
+
382
+ async def _run_collection(args: argparse.Namespace) -> None:
383
+ scenario_ids = _resolve_scenarios(args.scenarios)
384
+ models = _resolve_models(args.models)
385
+ if args.driver == "anthropic":
386
+ if AsyncAnthropic is None:
387
+ raise SystemExit("anthropic is not installed. Add it via train/requirements-train.txt before running.")
388
+ if not args.anthropic_api_key:
389
+ raise SystemExit("ANTHROPIC_API_KEY is required when --driver=anthropic")
390
+
391
+ output_path = Path(args.output)
392
+ output_path.parent.mkdir(parents=True, exist_ok=True)
393
+ if output_path.exists():
394
+ output_path.unlink()
395
+
396
+ jobs: asyncio.Queue[EpisodeJob] = asyncio.Queue()
397
+ for model in models:
398
+ for ordinal in range(args.episodes_per_model):
399
+ scenario_id = scenario_ids[ordinal % len(scenario_ids)]
400
+ jobs.put_nowait(EpisodeJob(model=model, scenario_id=scenario_id, ordinal=ordinal))
401
+
402
+ probe_client = httpx.AsyncClient(
403
+ base_url=args.env_url.rstrip("/"),
404
+ timeout=httpx.Timeout(args.env_timeout_s),
405
+ follow_redirects=True,
406
+ )
407
+ health = await probe_client.get("/health")
408
+ health.raise_for_status()
409
+ await probe_client.aclose()
410
+
411
+ anthropic_http_client = httpx.AsyncClient(
412
+ timeout=httpx.Timeout(args.anthropic_timeout_s),
413
+ limits=httpx.Limits(
414
+ max_connections=max(args.parallelism * 2, 20),
415
+ max_keepalive_connections=max(args.parallelism, 10),
416
+ ),
417
+ follow_redirects=True,
418
+ )
419
+ anthropic_client = None
420
+ if args.driver == "anthropic":
421
+ anthropic_client = AsyncAnthropic(
422
+ api_key=args.anthropic_api_key,
423
+ base_url=args.anthropic_base_url or None,
424
+ http_client=anthropic_http_client,
425
+ )
426
+
427
+ write_lock = asyncio.Lock()
428
+ counters = {
429
+ "completed": 0,
430
+ "resolved": 0,
431
+ "total": jobs.qsize(),
432
+ }
433
+ workers = [
434
+ asyncio.create_task(
435
+ _worker(
436
+ name=f"w{index + 1}",
437
+ jobs=jobs,
438
+ anthropic_client=anthropic_client,
439
+ args=args,
440
+ write_lock=write_lock,
441
+ output_path=output_path,
442
+ counters=counters,
443
+ )
444
+ )
445
+ for index in range(min(args.parallelism, counters["total"]))
446
+ ]
447
+
448
+ try:
449
+ await jobs.join()
450
+ finally:
451
+ for worker in workers:
452
+ worker.cancel()
453
+ await asyncio.gather(*workers, return_exceptions=True)
454
+ await anthropic_http_client.aclose()
455
+
456
+ success_rate = counters["resolved"] / counters["total"] if counters["total"] else 0.0
457
+ print(
458
+ f"completed={counters['completed']} resolved={counters['resolved']} "
459
+ f"success_rate={success_rate:.3f} output={output_path}",
460
+ file=sys.stderr,
461
+ flush=True,
462
+ )
463
+
464
+
465
+ def main() -> None:
466
+ args = parse_args()
467
+ asyncio.run(_run_collection(args))
468
+
469
+
470
+ if __name__ == "__main__":
471
+ main()
train/requirements-train.txt ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Pinned training-stack deps for the sanity_run.ipynb Colab notebook.
2
+ #
3
+ # Qwen3.5 4B support is still maturing in Unsloth; the version range below
4
+ # reflects what landed in their main branch as of 2026-04. If Qwen3.5 4B
5
+ # fails to load tonight, fall back to Qwen3 4B by changing MODEL_NAME in the
6
+ # notebook — no other change needed.
7
+
8
+ unsloth>=2025.12,<2026.06
9
+ unsloth_zoo>=2025.12,<2026.06
10
+ trl>=0.12.0,<0.16.0
11
+ transformers>=4.48.0,<4.60.0
12
+ accelerate>=1.2.0,<2.0.0
13
+ peft>=0.14.0,<0.20.0
14
+ datasets>=3.0.0,<4.0.0
15
+ wandb>=0.18.0,<1.0.0
16
+ bitsandbytes>=0.45.0
17
+ httpx>=0.27.0
18
+ anthropic>=0.97.0,<1.0.0
train/sanity_run.ipynb ADDED
@@ -0,0 +1,326 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# sre-gym — Training pipeline sanity run\n",
8
+ "\n",
9
+ "Purpose: verify the Colab+Unsloth+TRL+wandb pipeline compiles and runs end-to-end on an A100 *before* the hackathon. This notebook is not meant to train anything useful. It runs 200 SFT steps on a tiny hand-made dataset and saves a checkpoint.\n",
10
+ "\n",
11
+ "What a successful run looks like:\n",
12
+ "1. All deps install without version conflicts\n",
13
+ "2. `Qwen3.5-4B-Instruct` (or `Qwen3-4B-Instruct` fallback) loads in 4-bit via Unsloth\n",
14
+ "3. 200 steps of LoRA SFT run without OOM on A100 40GB\n",
15
+ "4. `wandb` logs show loss decreasing\n",
16
+ "5. Checkpoint is saved to `/content/sanity_ckpt/`\n",
17
+ "\n",
18
+ "Friday work: real dataset (2000+ Claude-driven trajectories), 2000+ SFT steps, then GRPO."
19
+ ]
20
+ },
21
+ {
22
+ "cell_type": "markdown",
23
+ "metadata": {},
24
+ "source": [
25
+ "## 0. Colab runtime sanity"
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "code",
30
+ "execution_count": null,
31
+ "metadata": {},
32
+ "outputs": [],
33
+ "source": [
34
+ "!nvidia-smi\n",
35
+ "!python -c 'import torch; print(\"torch\", torch.__version__, \"cuda\", torch.cuda.is_available())'"
36
+ ]
37
+ },
38
+ {
39
+ "cell_type": "markdown",
40
+ "metadata": {},
41
+ "source": [
42
+ "## 1. Install deps"
43
+ ]
44
+ },
45
+ {
46
+ "cell_type": "code",
47
+ "execution_count": null,
48
+ "metadata": {},
49
+ "outputs": [],
50
+ "source": [
51
+ "%%bash\n",
52
+ "# Unsloth's Colab install idiom (handles torch/xformers version pinning):\n",
53
+ "pip install -q --upgrade pip\n",
54
+ "pip install -q \"unsloth[colab-new]>=2025.12,<2026.06\"\n",
55
+ "pip install -q \"unsloth_zoo>=2025.12,<2026.06\"\n",
56
+ "pip install -q \"trl>=0.12,<0.16\" \"transformers>=4.48,<4.60\" \"peft>=0.14,<0.20\" \"accelerate>=1.2,<2.0\"\n",
57
+ "pip install -q \"datasets>=3.0,<4.0\" \"wandb>=0.18,<1.0\" \"bitsandbytes>=0.45\" httpx"
58
+ ]
59
+ },
60
+ {
61
+ "cell_type": "markdown",
62
+ "metadata": {},
63
+ "source": [
64
+ "## 2. Config\n",
65
+ "\n",
66
+ "If Qwen3.5 4B fails to load, swap `MODEL_NAME` to the Qwen3 4B fallback — no other change needed."
67
+ ]
68
+ },
69
+ {
70
+ "cell_type": "code",
71
+ "execution_count": null,
72
+ "metadata": {},
73
+ "outputs": [],
74
+ "source": [
75
+ "import os\n",
76
+ "\n",
77
+ "# Primary target (user-selected).\n",
78
+ "MODEL_NAME = \"unsloth/Qwen3.5-4B-Instruct-bnb-4bit\"\n",
79
+ "# Fallback if Unsloth can't load Qwen3.5 on Colab tonight.\n",
80
+ "FALLBACK_MODEL_NAME = \"unsloth/Qwen3-4B-Instruct-bnb-4bit\"\n",
81
+ "\n",
82
+ "MAX_SEQ_LENGTH = 4096\n",
83
+ "LORA_R = 32\n",
84
+ "LORA_ALPHA = 32\n",
85
+ "LEARNING_RATE = 2e-4\n",
86
+ "NUM_STEPS = 200\n",
87
+ "BATCH_SIZE = 2\n",
88
+ "GRAD_ACCUM = 4\n",
89
+ "OUT_DIR = \"/content/sanity_ckpt\"\n",
90
+ "\n",
91
+ "WANDB_PROJECT = os.environ.get(\"WANDB_PROJECT\", \"sre-gym-sanity\")\n",
92
+ "WANDB_RUN_NAME = os.environ.get(\"WANDB_RUN_NAME\", \"qwen35-4b-sft-toy-200\")\n",
93
+ "\n",
94
+ "os.environ.setdefault(\"WANDB_MODE\", \"online\") # flip to \"offline\" if no wandb login\n",
95
+ "print(f\"Primary model: {MODEL_NAME}\")"
96
+ ]
97
+ },
98
+ {
99
+ "cell_type": "markdown",
100
+ "metadata": {},
101
+ "source": [
102
+ "## 3. Load model via Unsloth (with fallback)"
103
+ ]
104
+ },
105
+ {
106
+ "cell_type": "code",
107
+ "execution_count": null,
108
+ "metadata": {},
109
+ "outputs": [],
110
+ "source": [
111
+ "from unsloth import FastLanguageModel\n",
112
+ "import torch\n",
113
+ "\n",
114
+ "model = None\n",
115
+ "tokenizer = None\n",
116
+ "errors = []\n",
117
+ "\n",
118
+ "for candidate in (MODEL_NAME, FALLBACK_MODEL_NAME):\n",
119
+ " try:\n",
120
+ " print(f\"Attempting to load: {candidate}\")\n",
121
+ " model, tokenizer = FastLanguageModel.from_pretrained(\n",
122
+ " model_name=candidate,\n",
123
+ " max_seq_length=MAX_SEQ_LENGTH,\n",
124
+ " dtype=None, # let Unsloth pick\n",
125
+ " load_in_4bit=True,\n",
126
+ " )\n",
127
+ " MODEL_NAME = candidate\n",
128
+ " print(f\"Loaded {candidate} ok\")\n",
129
+ " break\n",
130
+ " except Exception as exc:\n",
131
+ " errors.append((candidate, repr(exc)))\n",
132
+ " print(f\"Load failed for {candidate}: {exc}\")\n",
133
+ "\n",
134
+ "if model is None:\n",
135
+ " raise RuntimeError(\n",
136
+ " \"Both Qwen3.5 4B and Qwen3 4B failed to load via Unsloth. \"\n",
137
+ " \"Investigate Unsloth version mismatch before Friday. Errors: \" + str(errors)\n",
138
+ " )\n",
139
+ "\n",
140
+ "model = FastLanguageModel.get_peft_model(\n",
141
+ " model,\n",
142
+ " r=LORA_R,\n",
143
+ " target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\", \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
144
+ " lora_alpha=LORA_ALPHA,\n",
145
+ " lora_dropout=0.0,\n",
146
+ " bias=\"none\",\n",
147
+ " use_gradient_checkpointing=\"unsloth\",\n",
148
+ " random_state=42,\n",
149
+ ")"
150
+ ]
151
+ },
152
+ {
153
+ "cell_type": "markdown",
154
+ "metadata": {},
155
+ "source": [
156
+ "## 4. Toy training dataset (hand-made, 10 examples)\n",
157
+ "\n",
158
+ "These are derived from the 3 deterministic baseline trajectories. Purpose: exercise the tokenize+forward+backward+optimizer path. Not intended to generalize."
159
+ ]
160
+ },
161
+ {
162
+ "cell_type": "code",
163
+ "execution_count": null,
164
+ "metadata": {},
165
+ "outputs": [],
166
+ "source": [
167
+ "import json\n",
168
+ "from datasets import Dataset\n",
169
+ "\n",
170
+ "SYSTEM = 'You are an SRE agent. Respond with one UnifiedIncidentAction JSON object on each turn.'\n",
171
+ "\n",
172
+ "TOY_EXAMPLES = [\n",
173
+ " (\"worker_deploy_cascade tick 1 — DB crashed, worker degraded, recent worker deploy\",\n",
174
+ " '{\"action_type\":\"query_deploys\",\"service\":\"worker\"}'),\n",
175
+ " (\"worker_deploy_cascade tick 2 — saw worker@2026.04.23-bad 12m ago\",\n",
176
+ " '{\"action_type\":\"query_logs\",\"service\":\"worker\"}'),\n",
177
+ " (\"worker_deploy_cascade tick 3 — confirmed worker-originated harm\",\n",
178
+ " '{\"action_type\":\"rollback_deploy\",\"service\":\"worker\"}'),\n",
179
+ " (\"worker_deploy_cascade tick 4 — worker healthy, DB still crashed\",\n",
180
+ " '{\"action_type\":\"restart_service\",\"service\":\"database\"}'),\n",
181
+ " (\"worker_deploy_cascade tick 5 — all services healthy, checks pending\",\n",
182
+ " '{\"action_type\":\"run_check\",\"check_name\":\"end_to_end\"}'),\n",
183
+ " (\"db_config_rollout tick 1 — db degraded, worker decoy, pool-acquire errors\",\n",
184
+ " '{\"action_type\":\"query_deploys\",\"service\":\"database\"}'),\n",
185
+ " (\"db_config_rollout tick 2 — saw db@2026.04.24-cfg lowering pool to 12\",\n",
186
+ " '{\"action_type\":\"rollback_deploy\",\"service\":\"database\"}'),\n",
187
+ " (\"gateway_auth_rollout tick 1 — gateway 40% 401s, auth rollout 9m ago\",\n",
188
+ " '{\"action_type\":\"query_deploys\",\"service\":\"api-gateway\"}'),\n",
189
+ " (\"gateway_auth_rollout tick 2 — confirmed gateway@2026.04.24-auth is cause\",\n",
190
+ " '{\"action_type\":\"rollback_deploy\",\"service\":\"api-gateway\"}'),\n",
191
+ " (\"gateway_auth_rollout tick 3 — gateway healthy, verify end-to-end\",\n",
192
+ " '{\"action_type\":\"run_check\",\"check_name\":\"end_to_end\"}'),\n",
193
+ "]\n",
194
+ "\n",
195
+ "def _format(example):\n",
196
+ " prompt, action = example\n",
197
+ " messages = [\n",
198
+ " {\"role\": \"system\", \"content\": SYSTEM},\n",
199
+ " {\"role\": \"user\", \"content\": prompt},\n",
200
+ " {\"role\": \"assistant\", \"content\": action},\n",
201
+ " ]\n",
202
+ " text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)\n",
203
+ " return {\"text\": text}\n",
204
+ "\n",
205
+ "raw = [_format(ex) for ex in TOY_EXAMPLES]\n",
206
+ "dataset = Dataset.from_list(raw)\n",
207
+ "print(f\"toy dataset: {len(dataset)} rows\")\n",
208
+ "print(\"sample text (first 400 chars):\")\n",
209
+ "print(dataset[0]['text'][:400])"
210
+ ]
211
+ },
212
+ {
213
+ "cell_type": "markdown",
214
+ "metadata": {},
215
+ "source": [
216
+ "## 5. SFT training — 200 steps"
217
+ ]
218
+ },
219
+ {
220
+ "cell_type": "code",
221
+ "execution_count": null,
222
+ "metadata": {},
223
+ "outputs": [],
224
+ "source": [
225
+ "from trl import SFTTrainer, SFTConfig\n",
226
+ "\n",
227
+ "cfg = SFTConfig(\n",
228
+ " output_dir=OUT_DIR,\n",
229
+ " per_device_train_batch_size=BATCH_SIZE,\n",
230
+ " gradient_accumulation_steps=GRAD_ACCUM,\n",
231
+ " warmup_steps=10,\n",
232
+ " max_steps=NUM_STEPS,\n",
233
+ " learning_rate=LEARNING_RATE,\n",
234
+ " fp16=not torch.cuda.is_bf16_supported(),\n",
235
+ " bf16=torch.cuda.is_bf16_supported(),\n",
236
+ " logging_steps=10,\n",
237
+ " save_steps=100,\n",
238
+ " save_total_limit=2,\n",
239
+ " optim=\"adamw_8bit\",\n",
240
+ " weight_decay=0.01,\n",
241
+ " lr_scheduler_type=\"linear\",\n",
242
+ " seed=42,\n",
243
+ " report_to=\"wandb\",\n",
244
+ " run_name=WANDB_RUN_NAME,\n",
245
+ " max_seq_length=MAX_SEQ_LENGTH,\n",
246
+ " dataset_text_field=\"text\",\n",
247
+ " packing=False,\n",
248
+ ")\n",
249
+ "\n",
250
+ "os.environ.setdefault(\"WANDB_PROJECT\", WANDB_PROJECT)\n",
251
+ "\n",
252
+ "trainer = SFTTrainer(\n",
253
+ " model=model,\n",
254
+ " tokenizer=tokenizer,\n",
255
+ " train_dataset=dataset,\n",
256
+ " args=cfg,\n",
257
+ ")\n",
258
+ "\n",
259
+ "trainer_stats = trainer.train()\n",
260
+ "print(trainer_stats)"
261
+ ]
262
+ },
263
+ {
264
+ "cell_type": "markdown",
265
+ "metadata": {},
266
+ "source": [
267
+ "## 6. Save LoRA adapter + sanity-check inference"
268
+ ]
269
+ },
270
+ {
271
+ "cell_type": "code",
272
+ "execution_count": null,
273
+ "metadata": {},
274
+ "outputs": [],
275
+ "source": [
276
+ "model.save_pretrained(OUT_DIR)\n",
277
+ "tokenizer.save_pretrained(OUT_DIR)\n",
278
+ "\n",
279
+ "from unsloth import FastLanguageModel\n",
280
+ "FastLanguageModel.for_inference(model)\n",
281
+ "\n",
282
+ "test_prompt = 'worker_deploy_cascade tick 1 — DB crashed, worker degraded, recent worker deploy'\n",
283
+ "messages = [\n",
284
+ " {\"role\": \"system\", \"content\": SYSTEM},\n",
285
+ " {\"role\": \"user\", \"content\": test_prompt},\n",
286
+ "]\n",
287
+ "inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").to(\"cuda\")\n",
288
+ "out = model.generate(input_ids=inputs, max_new_tokens=64, temperature=0.0, do_sample=False)\n",
289
+ "print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))"
290
+ ]
291
+ },
292
+ {
293
+ "cell_type": "markdown",
294
+ "metadata": {},
295
+ "source": [
296
+ "## Verification checklist\n",
297
+ "\n",
298
+ "- [ ] Cell 3 loaded a model without OOM or import errors\n",
299
+ "- [ ] Cell 4 produced a chat-formatted dataset (no tokenizer errors)\n",
300
+ "- [ ] Cell 5 ran 200 steps, wandb logged a decreasing loss curve\n",
301
+ "- [ ] Cell 6 generated a JSON-ish action for the test prompt\n",
302
+ "- [ ] `/content/sanity_ckpt/adapter_model.safetensors` exists\n",
303
+ "\n",
304
+ "If any box is unchecked, debug tonight — do not enter Friday with an unknown failure mode."
305
+ ]
306
+ }
307
+ ],
308
+ "metadata": {
309
+ "accelerator": "GPU",
310
+ "colab": {
311
+ "gpuType": "A100",
312
+ "provenance": []
313
+ },
314
+ "kernelspec": {
315
+ "display_name": "Python 3",
316
+ "language": "python",
317
+ "name": "python3"
318
+ },
319
+ "language_info": {
320
+ "name": "python",
321
+ "version": "3.10"
322
+ }
323
+ },
324
+ "nbformat": 4,
325
+ "nbformat_minor": 4
326
+ }
unified_incident_env/models.py CHANGED
@@ -21,9 +21,23 @@ ActionType = Literal[
21
  "submit_hypothesis",
22
  "declare_resolved",
23
  ]
24
- Difficulty = Literal["easy"]
25
  MetricName = Literal["cpu", "error_rate", "latency"]
26
- ServiceName = Literal["api-gateway", "cache", "database", "worker"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ServiceStatus = Literal["healthy", "degraded", "crashed", "isolated"]
28
  WorkflowStage = Literal["triage", "mitigation", "validation", "resolved"]
29
  CheckName = Literal["database_recovery", "end_to_end"]
@@ -180,10 +194,13 @@ class UnifiedIncidentObservation(Observation):
180
  difficulty: Difficulty
181
  workflow_stage: WorkflowStage
182
  active_alerts: list[Alert] = Field(default_factory=list)
 
183
  service_health: dict[str, ServiceHealth] = Field(default_factory=dict)
184
  discovered_evidence: list[str] = Field(default_factory=list)
185
  recent_deploys: list[str] = Field(default_factory=list)
186
  checks: list[CheckResult] = Field(default_factory=list)
 
 
187
  user_impact: float = Field(ge=0.0, le=1.0)
188
  slo_burn_rate: float = Field(ge=0.0, le=1.0)
189
  incident_resolved: bool = False
@@ -222,10 +239,13 @@ class UnifiedIncidentState(State):
222
  max_ticks: int
223
  workflow_stage: WorkflowStage
224
  active_alerts: list[Alert] = Field(default_factory=list)
 
225
  service_health: dict[str, ServiceHealth] = Field(default_factory=dict)
226
  discovered_evidence: list[str] = Field(default_factory=list)
227
  recent_deploys: list[str] = Field(default_factory=list)
228
  checks: list[CheckResult] = Field(default_factory=list)
 
 
229
  user_impact: float = Field(ge=0.0, le=1.0)
230
  slo_burn_rate: float = Field(ge=0.0, le=1.0)
231
  incident_resolved: bool = False
 
21
  "submit_hypothesis",
22
  "declare_resolved",
23
  ]
24
+ Difficulty = Literal["easy", "medium", "hard"]
25
  MetricName = Literal["cpu", "error_rate", "latency"]
26
+ ServiceName = Literal[
27
+ "api-gateway",
28
+ "cache",
29
+ "database",
30
+ "worker",
31
+ # Noise-service pool surfaced by scenario.difficulty_knobs. These never
32
+ # appear in service_health (so agents can't query them through the
33
+ # action schema), but they do appear in alerts as distractors.
34
+ "stripe-webhook",
35
+ "email-queue",
36
+ "sessions-redis",
37
+ "image-cdn",
38
+ "feature-flags",
39
+ "analytics",
40
+ ]
41
  ServiceStatus = Literal["healthy", "degraded", "crashed", "isolated"]
42
  WorkflowStage = Literal["triage", "mitigation", "validation", "resolved"]
43
  CheckName = Literal["database_recovery", "end_to_end"]
 
194
  difficulty: Difficulty
195
  workflow_stage: WorkflowStage
196
  active_alerts: list[Alert] = Field(default_factory=list)
197
+ noise_alerts: list[Alert] = Field(default_factory=list)
198
  service_health: dict[str, ServiceHealth] = Field(default_factory=dict)
199
  discovered_evidence: list[str] = Field(default_factory=list)
200
  recent_deploys: list[str] = Field(default_factory=list)
201
  checks: list[CheckResult] = Field(default_factory=list)
202
+ blast_radius: int = 0
203
+ noise_queries: int = 0
204
  user_impact: float = Field(ge=0.0, le=1.0)
205
  slo_burn_rate: float = Field(ge=0.0, le=1.0)
206
  incident_resolved: bool = False
 
239
  max_ticks: int
240
  workflow_stage: WorkflowStage
241
  active_alerts: list[Alert] = Field(default_factory=list)
242
+ noise_alerts: list[Alert] = Field(default_factory=list)
243
  service_health: dict[str, ServiceHealth] = Field(default_factory=dict)
244
  discovered_evidence: list[str] = Field(default_factory=list)
245
  recent_deploys: list[str] = Field(default_factory=list)
246
  checks: list[CheckResult] = Field(default_factory=list)
247
+ blast_radius: int = 0
248
+ noise_queries: int = 0
249
  user_impact: float = Field(ge=0.0, le=1.0)
250
  slo_burn_rate: float = Field(ge=0.0, le=1.0)
251
  incident_resolved: bool = False
unified_incident_env/server/app.py CHANGED
@@ -68,7 +68,7 @@ def create_compatible_app():
68
  env_factory,
69
  UnifiedIncidentAction,
70
  UnifiedIncidentObservation,
71
- max_concurrent_envs=1,
72
  )
73
 
74
  @app.get("/", include_in_schema=False)
 
68
  env_factory,
69
  UnifiedIncidentAction,
70
  UnifiedIncidentObservation,
71
+ max_concurrent_envs=int(os.environ.get("MAX_CONCURRENT_ENVS", "32")),
72
  )
73
 
74
  @app.get("/", include_in_schema=False)
unified_incident_env/server/challenge.py CHANGED
@@ -3,6 +3,9 @@
3
  from __future__ import annotations
4
 
5
  from copy import deepcopy
 
 
 
6
  from typing import Any
7
 
8
  from ..models import (
@@ -15,8 +18,11 @@ from ..models import (
15
  )
16
 
17
  DEFAULT_SCENARIO_ID = "worker_deploy_cascade"
 
 
 
18
 
19
- SCENARIOS: dict[str, dict[str, Any]] = {
20
  "worker_deploy_cascade": {
21
  "id": "worker_deploy_cascade",
22
  "difficulty": "easy",
@@ -143,9 +149,525 @@ SCENARIOS: dict[str, dict[str, Any]] = {
143
  "affected_services": ["worker", "database", "api-gateway"],
144
  "best_next_action": "rollback_deploy",
145
  },
146
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
  }
148
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
  _RUNTIME_PROGRESS: dict[str, Any] | None = None
150
 
151
 
@@ -155,15 +677,26 @@ def get_scenario(scenario_id: str) -> dict[str, Any]:
155
  return deepcopy(SCENARIOS[scenario_id])
156
 
157
 
158
- def scenario_for_difficulty(difficulty: str) -> dict[str, Any]:
159
- for scenario in SCENARIOS.values():
160
- if scenario["difficulty"] == difficulty:
161
- return deepcopy(scenario)
 
 
 
 
 
 
 
 
 
 
 
162
  raise ValueError(f"Unknown difficulty {difficulty!r}")
163
 
164
 
165
- def list_scenarios(difficulty: str | None = None) -> ScenarioCatalog:
166
- if difficulty is not None and difficulty != "easy":
167
  raise ValueError(f"Unknown difficulty {difficulty!r}")
168
  scenarios = [
169
  ScenarioSummary(
@@ -175,19 +708,18 @@ def list_scenarios(difficulty: str | None = None) -> ScenarioCatalog:
175
  optimal_ticks=scenario["optimal_ticks"],
176
  )
177
  for scenario in SCENARIOS.values()
178
- if difficulty is None or scenario["difficulty"] == difficulty
 
179
  ]
180
  return ScenarioCatalog(
181
  default_scenario_id=DEFAULT_SCENARIO_ID,
182
- available_difficulties=["easy"],
183
  filtered_difficulty=difficulty,
184
  scenarios=scenarios,
185
  )
186
 
187
 
188
- def _baseline_actions(scenario_id: str) -> list[BaselineStep]:
189
- if scenario_id != DEFAULT_SCENARIO_ID:
190
- raise ValueError(f"No baseline for scenario_id {scenario_id!r}")
191
  return [
192
  BaselineStep(
193
  action=UnifiedIncidentAction(action_type="query_deploys", service="worker"),
@@ -240,13 +772,135 @@ def _baseline_actions(scenario_id: str) -> list[BaselineStep]:
240
  ]
241
 
242
 
243
- def list_baselines(scenario_id: str | None = None) -> BaselineCatalog:
244
- scenario_ids = [scenario_id] if scenario_id is not None else [DEFAULT_SCENARIO_ID]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
245
  baselines = [
246
  BaselineDefinition(
247
  scenario_id=current_id,
248
  name="deterministic-remediation-baseline",
249
- description="Minimal honest baseline that diagnoses from evidence, rolls back the worker, restarts the database, verifies recovery, and then declares resolved.",
250
  optimal_ticks=SCENARIOS[current_id]["optimal_ticks"],
251
  actions=_baseline_actions(current_id),
252
  )
 
3
  from __future__ import annotations
4
 
5
  from copy import deepcopy
6
+ import hashlib
7
+ import random
8
+ import re
9
  from typing import Any
10
 
11
  from ..models import (
 
18
  )
19
 
20
  DEFAULT_SCENARIO_ID = "worker_deploy_cascade"
21
+ PROCGEN_VARIANTS_PER_TEMPLATE = 4
22
+ _MINUTES_AGO_RE = re.compile(r"(\d+)\s+minutes ago")
23
+ _ROLLOUT_VERSION_RE = re.compile(r"(@\d{4}\.\d{2}\.\d{2}-)([a-z0-9-]+)")
24
 
25
+ _BASE_SCENARIOS: dict[str, dict[str, Any]] = {
26
  "worker_deploy_cascade": {
27
  "id": "worker_deploy_cascade",
28
  "difficulty": "easy",
 
149
  "affected_services": ["worker", "database", "api-gateway"],
150
  "best_next_action": "rollback_deploy",
151
  },
152
+ "remediation_recipe": {
153
+ "rollback_target": "worker",
154
+ "restart_target": "database",
155
+ "isolate_target": "worker",
156
+ "restart_requires_cause_removed": True,
157
+ "incident_driver": "worker",
158
+ "resolution_check": "end_to_end",
159
+ },
160
+ "post_rollback_services": {
161
+ "worker": {"status": "healthy", "cpu_pct": 32.0, "memory_pct": 37.0, "error_rate_pct": 2.0, "latency_ms": 40.0},
162
+ },
163
+ "post_rollback_user_impact": 0.55,
164
+ "post_rollback_slo_burn": 0.58,
165
+ "post_restart_services": {
166
+ "database": {"status": "healthy", "cpu_pct": 34.0, "memory_pct": 39.0, "error_rate_pct": 0.0, "latency_ms": 22.0},
167
+ "api-gateway": {"status": "healthy", "cpu_pct": 28.0, "memory_pct": 31.0, "error_rate_pct": 0.0, "latency_ms": 38.0},
168
+ },
169
+ "post_restart_user_impact": 0.14,
170
+ "post_restart_slo_burn": 0.18,
171
+ "post_isolate_services": {
172
+ "worker": {"status": "isolated", "cpu_pct": 8.0, "memory_pct": 18.0, "error_rate_pct": 0.0, "latency_ms": 0.0},
173
+ "database": {"status": "healthy", "cpu_pct": 41.0, "memory_pct": 46.0, "error_rate_pct": 0.0, "latency_ms": 26.0},
174
+ "api-gateway": {"status": "degraded", "cpu_pct": 34.0, "memory_pct": 33.0, "error_rate_pct": 7.0, "latency_ms": 91.0},
175
+ },
176
+ "post_isolate_user_impact": 0.45,
177
+ "post_isolate_slo_burn": 0.47,
178
+ "degraded_services": {
179
+ "worker": {"status": "degraded", "cpu_pct": 88.0, "memory_pct": 71.0, "error_rate_pct": 19.0, "latency_ms": 420.0},
180
+ "database": {"status": "crashed", "cpu_pct": 99.0, "memory_pct": 97.0, "error_rate_pct": 100.0, "latency_ms": 0.0},
181
+ "api-gateway": {"status": "degraded", "cpu_pct": 61.0, "memory_pct": 38.0, "error_rate_pct": 24.0, "latency_ms": 640.0},
182
+ },
183
+ "degraded_user_impact": 0.82,
184
+ "degraded_slo_burn": 0.91,
185
+ "failure_messages": {
186
+ "wrong_rollback_target": "Rolling back a service without a causal link wastes time and risk.",
187
+ "low_value_restart": "Restarting that service is not the safe next remediation step for this incident.",
188
+ "premature_restart": "Restarting before removing the trigger only causes another crash loop.",
189
+ "wrong_isolation_target": "Isolating that service does not contain the dominant failure path.",
190
+ },
191
+ "difficulty_knobs": {
192
+ "noise_services": ["stripe-webhook", "email-queue"],
193
+ "noise_alerts": [
194
+ {"service": "stripe-webhook", "severity": "warning", "message": "Stripe webhook retry volume slightly elevated (unrelated noise)."},
195
+ {"service": "email-queue", "severity": "warning", "message": "Email queue depth up 15% on a recurring 6h cycle (unrelated noise)."},
196
+ ],
197
+ "noise_logs": {
198
+ "stripe-webhook": "Webhook retries are within normal diurnal bounds; no payment-path regression.",
199
+ "email-queue": "Queue depth tracks the usual Monday-evening marketing batch; no regression.",
200
+ },
201
+ "blast_radius_budget": 2,
202
+ },
203
+ },
204
+ "db_config_rollout": {
205
+ "id": "db_config_rollout",
206
+ "difficulty": "medium",
207
+ "name": "Database Config Rollout Regression",
208
+ "description": (
209
+ "A database config push cut connection pool size and write requests now time out. "
210
+ "A separate worker deploy landed around the same time and looks suspicious but is not the cause. "
211
+ "The agent must avoid the decoy, roll back the database config, restart it, and verify recovery."
212
+ ),
213
+ "root_cause": "A bad database config rollout shrank the connection pool and is dropping writes.",
214
+ "optimal_ticks": 10,
215
+ "max_ticks": 12,
216
+ "critical_service_weights": {
217
+ "worker": 0.2,
218
+ "database": 0.5,
219
+ "api-gateway": 0.3,
220
+ "cache": 0.0,
221
+ },
222
+ "reward_config": {
223
+ "step_cost": 0.01,
224
+ "redundant_action_penalty": 0.02,
225
+ "unsafe_action_penalty": 0.08,
226
+ "premature_resolution_penalty": 0.2,
227
+ "successful_resolution_bonus": 0.25,
228
+ "hypothesis_bonus_scale": 0.12,
229
+ "forbidden_reward_sources": [
230
+ "evidence_discovery",
231
+ "query_success",
232
+ "unlock_events",
233
+ "stage_advancement",
234
+ "patch_id_selection",
235
+ ],
236
+ },
237
+ "initial_services": {
238
+ "api-gateway": {
239
+ "status": "degraded",
240
+ "cpu_pct": 44.0,
241
+ "memory_pct": 36.0,
242
+ "error_rate_pct": 17.0,
243
+ "latency_ms": 520.0,
244
+ },
245
+ "cache": {
246
+ "status": "healthy",
247
+ "cpu_pct": 20.0,
248
+ "memory_pct": 26.0,
249
+ "error_rate_pct": 0.0,
250
+ "latency_ms": 15.0,
251
+ },
252
+ "database": {
253
+ "status": "degraded",
254
+ "cpu_pct": 62.0,
255
+ "memory_pct": 54.0,
256
+ "error_rate_pct": 48.0,
257
+ "latency_ms": 880.0,
258
+ },
259
+ "worker": {
260
+ "status": "degraded",
261
+ "cpu_pct": 51.0,
262
+ "memory_pct": 44.0,
263
+ "error_rate_pct": 12.0,
264
+ "latency_ms": 310.0,
265
+ },
266
+ },
267
+ "initial_alerts": [
268
+ {
269
+ "service": "database",
270
+ "severity": "critical",
271
+ "message": "Database connection acquire timeouts at 48% and climbing.",
272
+ },
273
+ {
274
+ "service": "api-gateway",
275
+ "severity": "warning",
276
+ "message": "Write-path requests are returning sustained 5xx.",
277
+ },
278
+ {
279
+ "service": "worker",
280
+ "severity": "warning",
281
+ "message": "Worker write latency is elevated; retries are climbing.",
282
+ },
283
+ ],
284
+ "logs": {
285
+ "api-gateway": (
286
+ "Gateway upstream errors are downstream-driven: writes to the worker path return pool-exhaustion "
287
+ "errors originating from the database. No gateway deploys recorded in the last 24h."
288
+ ),
289
+ "cache": "Cache reads are healthy and unrelated to the current write-path failures.",
290
+ "database": (
291
+ "Database logs show 'could not acquire connection' errors immediately after config rollout "
292
+ "db@2026.04.24-cfg lowered max_connections from 80 to 12."
293
+ ),
294
+ "worker": (
295
+ "Worker logs show retries driven by downstream database pool exhaustion, not local faults. "
296
+ "Worker code deploy worker@2026.04.24-refactor is unrelated to the pool error signature."
297
+ ),
298
+ },
299
+ "metrics": {
300
+ "api-gateway": {
301
+ "error_rate": "Gateway 5xx rate is 17% and matches the database pool-exhaustion windows one-for-one.",
302
+ "latency": "Gateway p95 climbed to 520ms waiting on database connection acquire.",
303
+ },
304
+ "database": {
305
+ "cpu": "Database CPU is moderate (~62%), so this is not a compute overload pattern.",
306
+ "error_rate": "Database error rate is 48% and dominated by 'connection acquire timeout'.",
307
+ "latency": "Database write latency jumped to 880ms after the config rollout.",
308
+ },
309
+ "worker": {
310
+ "cpu": "Worker CPU is 51% — no local overload; retries are reactive.",
311
+ "error_rate": "Worker errors are retries against the saturated database pool.",
312
+ },
313
+ },
314
+ "dependencies": {
315
+ "api-gateway": "api-gateway -> worker -> database",
316
+ "worker": "worker -> database",
317
+ "database": "database is the terminal dependency; pool exhaustion here starves all upstream writers",
318
+ },
319
+ "deploy_history": {
320
+ "api-gateway": "No gateway deploys in the last 24h.",
321
+ "cache": "No cache deploys in the last 24h.",
322
+ "database": "Applied config db@2026.04.24-cfg 15 minutes ago (max_connections 80 -> 12).",
323
+ "worker": "Rolled out worker@2026.04.24-refactor 22 minutes ago (unrelated code cleanup).",
324
+ },
325
+ "checks": {
326
+ "database_recovery": "Confirms database write latency and pool health are back within SLO.",
327
+ "end_to_end": "Confirms gateway write-path traffic succeeds end-to-end.",
328
+ },
329
+ "truth": {
330
+ "root_cause": "database_only_failure",
331
+ "affected_services": ["database", "api-gateway", "worker"],
332
+ "best_next_action": "rollback_deploy",
333
+ },
334
+ "remediation_recipe": {
335
+ "rollback_target": "database",
336
+ "restart_target": "database",
337
+ "isolate_target": None,
338
+ "restart_requires_cause_removed": True,
339
+ "incident_driver": "database",
340
+ "resolution_check": "end_to_end",
341
+ },
342
+ "post_rollback_services": {
343
+ "database": {"status": "degraded", "cpu_pct": 48.0, "memory_pct": 42.0, "error_rate_pct": 6.0, "latency_ms": 120.0},
344
+ },
345
+ "post_rollback_user_impact": 0.40,
346
+ "post_rollback_slo_burn": 0.45,
347
+ "post_restart_services": {
348
+ "database": {"status": "healthy", "cpu_pct": 36.0, "memory_pct": 40.0, "error_rate_pct": 0.0, "latency_ms": 26.0},
349
+ "api-gateway": {"status": "healthy", "cpu_pct": 29.0, "memory_pct": 30.0, "error_rate_pct": 0.0, "latency_ms": 44.0},
350
+ "worker": {"status": "healthy", "cpu_pct": 33.0, "memory_pct": 36.0, "error_rate_pct": 1.0, "latency_ms": 48.0},
351
+ },
352
+ "post_restart_user_impact": 0.10,
353
+ "post_restart_slo_burn": 0.14,
354
+ "post_isolate_services": {},
355
+ "post_isolate_user_impact": 0.70,
356
+ "post_isolate_slo_burn": 0.75,
357
+ "degraded_services": {
358
+ "database": {"status": "degraded", "cpu_pct": 62.0, "memory_pct": 54.0, "error_rate_pct": 48.0, "latency_ms": 880.0},
359
+ "api-gateway": {"status": "degraded", "cpu_pct": 44.0, "memory_pct": 36.0, "error_rate_pct": 17.0, "latency_ms": 520.0},
360
+ "worker": {"status": "degraded", "cpu_pct": 51.0, "memory_pct": 44.0, "error_rate_pct": 12.0, "latency_ms": 310.0},
361
+ },
362
+ "degraded_user_impact": 0.70,
363
+ "degraded_slo_burn": 0.78,
364
+ "failure_messages": {
365
+ "wrong_rollback_target": "The worker deploy is a decoy; worker errors are reactive to database pool exhaustion.",
366
+ "low_value_restart": "Restarting that service does not address a database-config regression.",
367
+ "premature_restart": "Restarting the database before rolling back the config will re-inherit the 12-connection pool and fail again.",
368
+ "wrong_isolation_target": "Isolation is not useful here: the cause is a config regression, not a runaway service.",
369
+ },
370
+ "difficulty_knobs": {
371
+ "noise_services": ["sessions-redis", "analytics"],
372
+ "noise_alerts": [
373
+ {"service": "sessions-redis", "severity": "warning", "message": "Sessions-redis p99 latency nudged up 8ms (unrelated noise)."},
374
+ {"service": "analytics", "severity": "warning", "message": "Analytics consumer lag up to 45s from baseline 30s (unrelated noise)."},
375
+ ],
376
+ "noise_logs": {
377
+ "sessions-redis": "No errors on sessions-redis; hit ratio stable.",
378
+ "analytics": "Analytics consumer lag fluctuation consistent with upstream Kafka producer batching, unrelated to current incident.",
379
+ },
380
+ "blast_radius_budget": 2,
381
+ },
382
+ },
383
+ "gateway_auth_rollout": {
384
+ "id": "gateway_auth_rollout",
385
+ "difficulty": "hard",
386
+ "name": "Gateway Auth Rollout Regression",
387
+ "description": (
388
+ "A new api-gateway auth-middleware rollout is rejecting ~40% of valid logins. "
389
+ "A recent worker deploy and elevated worker queue depth make the worker look like a plausible suspect. "
390
+ "The agent must localize to the gateway, roll back its deploy, and verify recovery without unnecessary restarts."
391
+ ),
392
+ "root_cause": "A bad api-gateway auth-middleware rollout is rejecting valid logins.",
393
+ "optimal_ticks": 8,
394
+ "max_ticks": 10,
395
+ "critical_service_weights": {
396
+ "worker": 0.15,
397
+ "database": 0.15,
398
+ "api-gateway": 0.70,
399
+ "cache": 0.0,
400
+ },
401
+ "reward_config": {
402
+ "step_cost": 0.01,
403
+ "redundant_action_penalty": 0.02,
404
+ "unsafe_action_penalty": 0.12,
405
+ "premature_resolution_penalty": 0.3,
406
+ "successful_resolution_bonus": 0.3,
407
+ "hypothesis_bonus_scale": 0.12,
408
+ "forbidden_reward_sources": [
409
+ "evidence_discovery",
410
+ "query_success",
411
+ "unlock_events",
412
+ "stage_advancement",
413
+ "patch_id_selection",
414
+ ],
415
+ },
416
+ "initial_services": {
417
+ "api-gateway": {
418
+ "status": "degraded",
419
+ "cpu_pct": 38.0,
420
+ "memory_pct": 42.0,
421
+ "error_rate_pct": 41.0,
422
+ "latency_ms": 180.0,
423
+ },
424
+ "cache": {
425
+ "status": "healthy",
426
+ "cpu_pct": 17.0,
427
+ "memory_pct": 23.0,
428
+ "error_rate_pct": 0.0,
429
+ "latency_ms": 12.0,
430
+ },
431
+ "database": {
432
+ "status": "healthy",
433
+ "cpu_pct": 38.0,
434
+ "memory_pct": 41.0,
435
+ "error_rate_pct": 1.0,
436
+ "latency_ms": 28.0,
437
+ },
438
+ "worker": {
439
+ "status": "degraded",
440
+ "cpu_pct": 63.0,
441
+ "memory_pct": 48.0,
442
+ "error_rate_pct": 4.0,
443
+ "latency_ms": 220.0,
444
+ },
445
+ },
446
+ "initial_alerts": [
447
+ {
448
+ "service": "api-gateway",
449
+ "severity": "critical",
450
+ "message": "Gateway is returning 401 on ~40% of valid login attempts.",
451
+ },
452
+ {
453
+ "service": "worker",
454
+ "severity": "warning",
455
+ "message": "Worker queue depth is elevated from the retry storm upstream.",
456
+ },
457
+ ],
458
+ "logs": {
459
+ "api-gateway": (
460
+ "Gateway logs show auth-middleware rejecting tokens with valid signatures. "
461
+ "Rejection rate started exactly at the gateway@2026.04.24-auth rollout boundary."
462
+ ),
463
+ "cache": "Cache hit ratio stable and unrelated.",
464
+ "database": "Database logs are clean; no increase in errors or latency.",
465
+ "worker": (
466
+ "Worker logs show client-side retry storms triggered by upstream 401s, not local faults. "
467
+ "Worker deploy worker@2026.04.24-hotfix is a log-format tweak and does not touch auth."
468
+ ),
469
+ },
470
+ "metrics": {
471
+ "api-gateway": {
472
+ "error_rate": "Gateway error rate is 41%, dominated by 401 responses (auth failures).",
473
+ "latency": "Gateway latency is normal — errors are fast rejections, not timeouts.",
474
+ },
475
+ "database": {
476
+ "cpu": "Database CPU is 38% (normal).",
477
+ "error_rate": "Database error rate is ~1% and flat.",
478
+ },
479
+ "worker": {
480
+ "cpu": "Worker CPU is 63% from retry volume, not workload.",
481
+ "error_rate": "Worker errors are reactive retries, not primary failures.",
482
+ },
483
+ },
484
+ "dependencies": {
485
+ "api-gateway": "api-gateway -> (auth) -> worker -> database",
486
+ "worker": "worker -> database",
487
+ "database": "database is healthy; it is not on the fault path",
488
+ },
489
+ "deploy_history": {
490
+ "api-gateway": "Rolled out gateway@2026.04.24-auth 9 minutes ago (auth middleware rewrite).",
491
+ "cache": "No cache deploys in the last 24h.",
492
+ "database": "No database deploys in the last 24h.",
493
+ "worker": "Rolled out worker@2026.04.24-hotfix 18 minutes ago (log-format tweak, no auth changes).",
494
+ },
495
+ "checks": {
496
+ "database_recovery": "Confirms the database is healthy (always healthy in this scenario).",
497
+ "end_to_end": "Confirms gateway login traffic succeeds end-to-end.",
498
+ },
499
+ "truth": {
500
+ "root_cause": "api_gateway_fault",
501
+ "affected_services": ["api-gateway", "worker"],
502
+ "best_next_action": "rollback_deploy",
503
+ },
504
+ "remediation_recipe": {
505
+ "rollback_target": "api-gateway",
506
+ "restart_target": None,
507
+ "isolate_target": "api-gateway",
508
+ "restart_requires_cause_removed": True,
509
+ "incident_driver": "api-gateway",
510
+ "resolution_check": "end_to_end",
511
+ },
512
+ "post_rollback_services": {
513
+ "api-gateway": {"status": "healthy", "cpu_pct": 30.0, "memory_pct": 34.0, "error_rate_pct": 1.0, "latency_ms": 38.0},
514
+ "worker": {"status": "healthy", "cpu_pct": 34.0, "memory_pct": 36.0, "error_rate_pct": 1.0, "latency_ms": 52.0},
515
+ },
516
+ "post_rollback_user_impact": 0.12,
517
+ "post_rollback_slo_burn": 0.18,
518
+ "post_restart_services": {},
519
+ "post_restart_user_impact": 0.12,
520
+ "post_restart_slo_burn": 0.18,
521
+ "post_isolate_services": {
522
+ "api-gateway": {"status": "isolated", "cpu_pct": 6.0, "memory_pct": 14.0, "error_rate_pct": 0.0, "latency_ms": 0.0},
523
+ },
524
+ "post_isolate_user_impact": 0.55,
525
+ "post_isolate_slo_burn": 0.60,
526
+ "degraded_services": {
527
+ "api-gateway": {"status": "degraded", "cpu_pct": 38.0, "memory_pct": 42.0, "error_rate_pct": 41.0, "latency_ms": 180.0},
528
+ "worker": {"status": "degraded", "cpu_pct": 63.0, "memory_pct": 48.0, "error_rate_pct": 4.0, "latency_ms": 220.0},
529
+ },
530
+ "degraded_user_impact": 0.65,
531
+ "degraded_slo_burn": 0.72,
532
+ "failure_messages": {
533
+ "wrong_rollback_target": "The worker deploy is a log-format tweak and is not on the auth fault path.",
534
+ "low_value_restart": "Restarting a service does not fix a config/middleware regression rolled out as a deploy.",
535
+ "premature_restart": "Restarting before rolling back the gateway auth change just restarts the same bad middleware.",
536
+ "wrong_isolation_target": "Isolating workers or database cuts healthy traffic without fixing the gateway auth fault.",
537
+ },
538
+ "difficulty_knobs": {
539
+ "noise_services": ["stripe-webhook", "image-cdn", "feature-flags"],
540
+ "noise_alerts": [
541
+ {"service": "stripe-webhook", "severity": "warning", "message": "Stripe webhook signing drift warning — known benign noise from clock skew."},
542
+ {"service": "image-cdn", "severity": "warning", "message": "Image CDN purge lag on asia-east1 edge (unrelated noise)."},
543
+ {"service": "feature-flags", "severity": "warning", "message": "Feature-flags subscriber reconnected after routine rotation (unrelated noise)."},
544
+ ],
545
+ "noise_logs": {
546
+ "stripe-webhook": "Webhook signature log shows no delivery failures; flagged warnings are clock-skew benign.",
547
+ "image-cdn": "CDN purge lag is within published SLA; no customer-visible impact.",
548
+ "feature-flags": "Feature-flags consumer reconnect logs are routine rotation; no delivery loss.",
549
+ },
550
+ "blast_radius_budget": 1,
551
+ },
552
+ },
553
  }
554
 
555
+
556
+ def _stable_rng(*parts: object) -> random.Random:
557
+ seed_material = "::".join(str(part) for part in parts)
558
+ digest = hashlib.sha256(seed_material.encode("utf-8")).hexdigest()
559
+ return random.Random(int(digest[:16], 16))
560
+
561
+
562
+ def _clamp(value: float, lower: float, upper: float) -> float:
563
+ return max(lower, min(upper, value))
564
+
565
+
566
+ def _jitter_metric(value: float, *, rng: random.Random, spread: float, floor: float = 0.0, ceil: float = 100.0) -> float:
567
+ if value == 0.0:
568
+ return 0.0
569
+ delta = value * rng.uniform(-spread, spread)
570
+ return round(_clamp(value + delta, floor, ceil), 1)
571
+
572
+
573
+ def _jitter_latency(value: float, *, rng: random.Random, spread: float) -> float:
574
+ if value == 0.0:
575
+ return 0.0
576
+ delta = value * rng.uniform(-spread, spread)
577
+ return round(max(0.0, value + delta), 1)
578
+
579
+
580
+ def _mutate_service_table(table: dict[str, dict[str, Any]], *, rng: random.Random, spread: float) -> dict[str, dict[str, Any]]:
581
+ mutated: dict[str, dict[str, Any]] = {}
582
+ for service_name, payload in table.items():
583
+ item = dict(payload)
584
+ item["cpu_pct"] = _jitter_metric(float(item["cpu_pct"]), rng=rng, spread=spread)
585
+ item["memory_pct"] = _jitter_metric(float(item["memory_pct"]), rng=rng, spread=spread)
586
+ item["error_rate_pct"] = _jitter_metric(float(item["error_rate_pct"]), rng=rng, spread=spread)
587
+ item["latency_ms"] = _jitter_latency(float(item["latency_ms"]), rng=rng, spread=spread)
588
+ mutated[service_name] = item
589
+ return mutated
590
+
591
+
592
+ def _mutate_deploy_text(text: str, *, rng: random.Random, service: str) -> str:
593
+ age_minutes = rng.randint(6, 28)
594
+ rollout_suffix = f"{service[:3]}{rng.randint(11, 98)}"
595
+ updated = _MINUTES_AGO_RE.sub(f"{age_minutes} minutes ago", text, count=1)
596
+ return _ROLLOUT_VERSION_RE.sub(rf"\1{rollout_suffix}", updated, count=1)
597
+
598
+
599
+ def _mutate_noise_knobs(knobs: dict[str, Any], *, rng: random.Random, variant_index: int) -> dict[str, Any]:
600
+ mutated = deepcopy(knobs)
601
+ noise_services = list(mutated.get("noise_services", []))
602
+ if not noise_services:
603
+ return mutated
604
+ rotation = variant_index % len(noise_services)
605
+ rotated_services = noise_services[rotation:] + noise_services[:rotation]
606
+ alert_pool = {item["service"]: dict(item) for item in mutated.get("noise_alerts", [])}
607
+ log_pool = dict(mutated.get("noise_logs", {}))
608
+ selected_count = min(len(rotated_services), max(1, 1 + (variant_index % len(rotated_services))))
609
+ selected_services = rotated_services[:selected_count]
610
+ mutated["noise_services"] = selected_services
611
+ mutated["noise_alerts"] = [alert_pool[service] for service in selected_services if service in alert_pool]
612
+ mutated["noise_logs"] = {service: log_pool[service] for service in selected_services if service in log_pool}
613
+ return mutated
614
+
615
+
616
+ def _procgen_variant_id(template_id: str, variant_index: int) -> str:
617
+ return f"{template_id}__p{variant_index + 1:02d}"
618
+
619
+
620
+ def _materialize_procgen_variant(template_id: str, template: dict[str, Any], *, variant_index: int) -> dict[str, Any]:
621
+ rng = _stable_rng(template_id, variant_index)
622
+ spread_by_difficulty = {
623
+ "easy": 0.05,
624
+ "medium": 0.08,
625
+ "hard": 0.10,
626
+ }
627
+ spread = spread_by_difficulty.get(template["difficulty"], 0.06)
628
+ scenario = deepcopy(template)
629
+ scenario["id"] = _procgen_variant_id(template_id, variant_index)
630
+ scenario["template_id"] = template_id
631
+ scenario["is_procgen"] = True
632
+ scenario["name"] = f"{template['name']} [procgen {variant_index + 1}]"
633
+ scenario["description"] = (
634
+ f"{template['description']} "
635
+ f"Variant {variant_index + 1} reshuffles timing and distractor noise."
636
+ )
637
+ for key in (
638
+ "initial_services",
639
+ "degraded_services",
640
+ "post_rollback_services",
641
+ "post_restart_services",
642
+ "post_isolate_services",
643
+ ):
644
+ scenario[key] = _mutate_service_table(template.get(key, {}), rng=rng, spread=spread)
645
+ scenario["deploy_history"] = {
646
+ service: _mutate_deploy_text(text, rng=rng, service=service)
647
+ for service, text in template.get("deploy_history", {}).items()
648
+ }
649
+ scenario["difficulty_knobs"] = _mutate_noise_knobs(template.get("difficulty_knobs", {}), rng=rng, variant_index=variant_index)
650
+ return scenario
651
+
652
+
653
+ def _build_scenarios() -> dict[str, dict[str, Any]]:
654
+ catalog: dict[str, dict[str, Any]] = {}
655
+ for template_id, scenario in _BASE_SCENARIOS.items():
656
+ catalog[template_id] = deepcopy(scenario)
657
+ catalog[template_id]["template_id"] = template_id
658
+ catalog[template_id]["is_procgen"] = False
659
+ for variant_index in range(PROCGEN_VARIANTS_PER_TEMPLATE):
660
+ variant = _materialize_procgen_variant(
661
+ template_id,
662
+ catalog[template_id],
663
+ variant_index=variant_index,
664
+ )
665
+ catalog[variant["id"]] = variant
666
+ return catalog
667
+
668
+
669
+ SCENARIOS: dict[str, dict[str, Any]] = _build_scenarios()
670
+
671
  _RUNTIME_PROGRESS: dict[str, Any] | None = None
672
 
673
 
 
677
  return deepcopy(SCENARIOS[scenario_id])
678
 
679
 
680
+ SUPPORTED_DIFFICULTIES: tuple[str, ...] = ("easy", "medium", "hard")
681
+
682
+
683
+ def scenario_for_difficulty(difficulty: str, seed: int | None = None) -> dict[str, Any]:
684
+ matches = [
685
+ scenario
686
+ for scenario in SCENARIOS.values()
687
+ if scenario["difficulty"] == difficulty
688
+ ]
689
+ if seed is None:
690
+ for scenario in matches:
691
+ if not scenario.get("is_procgen", False):
692
+ return deepcopy(scenario)
693
+ if matches:
694
+ return deepcopy(matches[(seed or 0) % len(matches)])
695
  raise ValueError(f"Unknown difficulty {difficulty!r}")
696
 
697
 
698
+ def list_scenarios(difficulty: str | None = None, include_procgen: bool = True) -> ScenarioCatalog:
699
+ if difficulty is not None and difficulty not in SUPPORTED_DIFFICULTIES:
700
  raise ValueError(f"Unknown difficulty {difficulty!r}")
701
  scenarios = [
702
  ScenarioSummary(
 
708
  optimal_ticks=scenario["optimal_ticks"],
709
  )
710
  for scenario in SCENARIOS.values()
711
+ if (difficulty is None or scenario["difficulty"] == difficulty)
712
+ and (include_procgen or not scenario.get("is_procgen", False))
713
  ]
714
  return ScenarioCatalog(
715
  default_scenario_id=DEFAULT_SCENARIO_ID,
716
+ available_difficulties=list(SUPPORTED_DIFFICULTIES),
717
  filtered_difficulty=difficulty,
718
  scenarios=scenarios,
719
  )
720
 
721
 
722
+ def _worker_cascade_baseline() -> list[BaselineStep]:
 
 
723
  return [
724
  BaselineStep(
725
  action=UnifiedIncidentAction(action_type="query_deploys", service="worker"),
 
772
  ]
773
 
774
 
775
+ def _db_config_rollout_baseline() -> list[BaselineStep]:
776
+ return [
777
+ BaselineStep(
778
+ action=UnifiedIncidentAction(action_type="query_logs", service="database"),
779
+ rationale="Database is the loudest alert; inspect logs for the actual error signature.",
780
+ ),
781
+ BaselineStep(
782
+ action=UnifiedIncidentAction(action_type="query_deploys", service="database"),
783
+ rationale="Pool-acquire errors suggest a config change; check recent database rollouts.",
784
+ ),
785
+ BaselineStep(
786
+ action=UnifiedIncidentAction(action_type="query_metrics", service="database", metric="error_rate"),
787
+ rationale="Confirm the error pattern is pool exhaustion rather than compute overload.",
788
+ ),
789
+ BaselineStep(
790
+ action=UnifiedIncidentAction(action_type="query_logs", service="worker"),
791
+ rationale="Rule out the decoy worker deploy by reading worker logs directly.",
792
+ ),
793
+ BaselineStep(
794
+ action=UnifiedIncidentAction(
795
+ action_type="submit_hypothesis",
796
+ hypothesis={
797
+ "root_cause": "database_only_failure",
798
+ "affected_services": ["database", "api-gateway", "worker"],
799
+ "confidence": 0.8,
800
+ "recommended_next_action": "rollback_deploy",
801
+ },
802
+ ),
803
+ rationale="Localize the fault to the database config before remediating.",
804
+ ),
805
+ BaselineStep(
806
+ action=UnifiedIncidentAction(action_type="rollback_deploy", service="database"),
807
+ rationale="Roll back the offending database config rollout.",
808
+ ),
809
+ BaselineStep(
810
+ action=UnifiedIncidentAction(action_type="restart_service", service="database"),
811
+ rationale="Restart the database cleanly against the restored pool config.",
812
+ ),
813
+ BaselineStep(
814
+ action=UnifiedIncidentAction(action_type="run_check", check_name="database_recovery"),
815
+ rationale="Verify database pool health and write latency are back within SLO.",
816
+ ),
817
+ BaselineStep(
818
+ action=UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"),
819
+ rationale="Verify gateway write-path traffic succeeds end-to-end.",
820
+ ),
821
+ BaselineStep(
822
+ action=UnifiedIncidentAction(action_type="declare_resolved"),
823
+ rationale="Declare resolved only after objective checks pass.",
824
+ ),
825
+ ]
826
+
827
+
828
+ def _gateway_auth_rollout_baseline() -> list[BaselineStep]:
829
+ return [
830
+ BaselineStep(
831
+ action=UnifiedIncidentAction(action_type="query_logs", service="api-gateway"),
832
+ rationale="Gateway is rejecting logins; read gateway logs to localize the rejection class.",
833
+ ),
834
+ BaselineStep(
835
+ action=UnifiedIncidentAction(action_type="query_deploys", service="api-gateway"),
836
+ rationale="Login rejection aligns with a recent auth middleware rollout; confirm deploy timing.",
837
+ ),
838
+ BaselineStep(
839
+ action=UnifiedIncidentAction(action_type="query_deploys", service="worker"),
840
+ rationale="Rule out the worker deploy explicitly rather than assuming.",
841
+ ),
842
+ BaselineStep(
843
+ action=UnifiedIncidentAction(
844
+ action_type="submit_hypothesis",
845
+ hypothesis={
846
+ "root_cause": "api_gateway_fault",
847
+ "affected_services": ["api-gateway", "worker"],
848
+ "confidence": 0.85,
849
+ "recommended_next_action": "rollback_deploy",
850
+ },
851
+ ),
852
+ rationale="Commit a calibrated hypothesis localizing to the gateway auth rollout.",
853
+ ),
854
+ BaselineStep(
855
+ action=UnifiedIncidentAction(action_type="rollback_deploy", service="api-gateway"),
856
+ rationale="Roll back the bad auth middleware rollout; no restart needed.",
857
+ ),
858
+ BaselineStep(
859
+ action=UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"),
860
+ rationale="Verify that gateway login traffic now succeeds end-to-end.",
861
+ ),
862
+ BaselineStep(
863
+ action=UnifiedIncidentAction(action_type="run_check", check_name="database_recovery"),
864
+ rationale="Confirm the database is (and stayed) healthy throughout.",
865
+ ),
866
+ BaselineStep(
867
+ action=UnifiedIncidentAction(action_type="declare_resolved"),
868
+ rationale="Declare resolved only after objective checks pass.",
869
+ ),
870
+ ]
871
+
872
+
873
+ _BASELINE_BUILDERS = {
874
+ "worker_deploy_cascade": _worker_cascade_baseline,
875
+ "db_config_rollout": _db_config_rollout_baseline,
876
+ "gateway_auth_rollout": _gateway_auth_rollout_baseline,
877
+ }
878
+
879
+
880
+ def _baseline_actions(scenario_id: str) -> list[BaselineStep]:
881
+ template_id = SCENARIOS[scenario_id].get("template_id", scenario_id)
882
+ builder = _BASELINE_BUILDERS.get(template_id)
883
+ if builder is None:
884
+ raise ValueError(f"No baseline for scenario_id {scenario_id!r}")
885
+ return builder()
886
+
887
+
888
+ def list_baselines(scenario_id: str | None = None, include_procgen: bool = True) -> BaselineCatalog:
889
+ if scenario_id is not None:
890
+ if scenario_id not in SCENARIOS:
891
+ raise ValueError(f"Unknown scenario_id {scenario_id!r}")
892
+ scenario_ids = [scenario_id]
893
+ else:
894
+ scenario_ids = [
895
+ current_id
896
+ for current_id, scenario in SCENARIOS.items()
897
+ if include_procgen or not scenario.get("is_procgen", False)
898
+ ]
899
  baselines = [
900
  BaselineDefinition(
901
  scenario_id=current_id,
902
  name="deterministic-remediation-baseline",
903
+ description=SCENARIOS[current_id]["description"],
904
  optimal_ticks=SCENARIOS[current_id]["optimal_ticks"],
905
  actions=_baseline_actions(current_id),
906
  )
unified_incident_env/server/environment.py CHANGED
@@ -58,7 +58,7 @@ STATUS_VALUES = {
58
  class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncidentObservation, UnifiedIncidentState]):
59
  """A bounded-action incident diagnosis and safe remediation environment."""
60
 
61
- SUPPORTS_CONCURRENT_SESSIONS = False
62
 
63
  def __init__(self) -> None:
64
  super().__init__()
@@ -78,13 +78,12 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
78
  )
79
 
80
  def reset(self, seed: int | None = None, episode_id: str | None = None, **kwargs: Any) -> UnifiedIncidentObservation:
81
- del seed
82
  scenario_id = kwargs.get("scenario_id")
83
  difficulty = kwargs.get("difficulty")
84
  if scenario_id:
85
  scenario = get_scenario(scenario_id)
86
  elif difficulty:
87
- scenario = scenario_for_difficulty(difficulty)
88
  else:
89
  scenario = get_scenario(DEFAULT_SCENARIO_ID)
90
  self._episode = self._make_episode(scenario, episode_id=episode_id)
@@ -204,6 +203,11 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
204
  "database_recovery": CheckResult(name="database_recovery", passed=False, detail="Database recovery has not been verified yet."),
205
  "end_to_end": CheckResult(name="end_to_end", passed=False, detail="End-to-end health has not been verified yet."),
206
  }
 
 
 
 
 
207
  return {
208
  "episode_id": episode_id or str(uuid.uuid4()),
209
  "scenario": scenario,
@@ -213,16 +217,16 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
213
  "difficulty": scenario["difficulty"],
214
  "services": services,
215
  "alerts": [Alert(**payload) for payload in scenario["initial_alerts"]],
 
216
  "discovered_evidence": [],
217
  "evidence_seen": set(),
218
- "recent_deploys": [scenario["deploy_history"]["worker"]],
219
  "checks": checks,
220
- "user_impact": 0.82,
221
- "slo_burn_rate": 0.91,
222
  "containment_applied": False,
223
  "cause_removed": False,
224
- "worker_isolated": False,
225
- "worker_version": "worker@2026.04.23-bad",
226
  "hypothesis_seen": set(),
227
  "failure_type": None,
228
  "why_failed": None,
@@ -233,12 +237,16 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
233
  "workflow_stage": "triage",
234
  "cumulative_reward": 0.0,
235
  "wasteful_ticks": 0,
 
 
236
  "score_breakdown": {
237
  "recovery_score": 0.0,
238
  "containment_score": 0.0,
239
  "verification_score": 0.0,
240
  "impact_score": 0.0,
241
- "efficiency_score": 0.10,
 
 
242
  "final_score": 0.10,
243
  },
244
  "final_score": 0.10,
@@ -246,20 +254,44 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
246
  "done": False,
247
  }
248
 
 
 
 
 
 
 
 
 
 
 
249
  def _query_logs(self, service: str | None) -> str:
250
  assert service is not None
 
 
 
 
 
251
  return self._episode["scenario"]["logs"][service]
252
 
253
  def _query_metrics(self, service: str | None, metric: str | None) -> str:
254
  assert service is not None and metric is not None
 
 
 
255
  return self._episode["scenario"]["metrics"][service][metric]
256
 
257
  def _query_dependencies(self, service: str | None) -> str:
258
  assert service is not None
 
 
 
259
  return self._episode["scenario"]["dependencies"][service]
260
 
261
  def _query_deploys(self, service: str | None) -> str:
262
  assert service is not None
 
 
 
263
  return self._episode["scenario"]["deploy_history"][service]
264
 
265
  def _submit_hypothesis(self, action: UnifiedIncidentAction) -> tuple[float, bool, str]:
@@ -280,108 +312,112 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
280
  reward = (0.04 * cause_match) + (0.03 * service_match) + (0.03 * action_quality) + (0.02 * calibration)
281
  return round(reward, 4), True, "Hypothesis recorded. Reward reflects root-cause accuracy, service localization, confidence calibration, and next-action quality."
282
 
 
 
 
 
 
 
 
 
 
 
 
 
 
283
  def _rollback_deploy(self, service: str | None) -> tuple[bool, float, str]:
284
  assert service is not None
285
- if service != "worker":
286
- self._set_failure("wrong_remediation_target", "Rolling back a service without a causal link wastes time and risk.")
 
 
 
 
 
287
  return False, self._unsafe_penalty(), f"Rollback on {service} did not address the incident."
288
  if self._episode["cause_removed"]:
289
- return False, 0.0, "Worker deploy is already rolled back."
290
  self._episode["cause_removed"] = True
291
  self._episode["containment_applied"] = True
292
- self._episode["worker_version"] = "worker@2026.04.23-good"
293
- self._episode["services"]["worker"] = ServiceHealth(
294
- name="worker",
295
- status="healthy",
296
- cpu_pct=32.0,
297
- memory_pct=37.0,
298
- error_rate_pct=2.0,
299
- latency_ms=40.0,
300
- )
301
- self._episode["user_impact"] = min(self._episode["user_impact"], 0.55)
302
- self._episode["slo_burn_rate"] = min(self._episode["slo_burn_rate"], 0.58)
303
- return True, 0.0, "Rolled back the worker deploy; downstream load should now stabilize after dependent services recover."
304
 
305
  def _restart_service(self, service: str | None) -> tuple[bool, float, str]:
306
  assert service is not None
307
- if service == "database":
308
- if not self._episode["cause_removed"]:
309
- self._set_failure("premature_restart", "Restarting the database before removing the trigger only causes another crash loop.")
310
- return False, self._unsafe_penalty(), "Database restart failed because the worker is still driving overload."
311
- self._episode["services"]["database"] = ServiceHealth(
312
- name="database",
313
- status="healthy",
314
- cpu_pct=34.0,
315
- memory_pct=39.0,
316
- error_rate_pct=0.0,
317
- latency_ms=22.0,
318
  )
319
- self._episode["services"]["api-gateway"] = ServiceHealth(
320
- name="api-gateway",
321
- status="healthy",
322
- cpu_pct=28.0,
323
- memory_pct=31.0,
324
- error_rate_pct=0.0,
325
- latency_ms=38.0,
326
  )
327
- self._episode["user_impact"] = 0.14
328
- self._episode["slo_burn_rate"] = 0.18
329
- return True, 0.0, "Database restarted cleanly after the bad deploy was rolled back."
330
- self._set_failure("low_value_restart", f"Restarting {service} is not the safe next remediation step for this incident.")
331
- return False, self._unsafe_penalty() / 2, f"Restarting {service} had little or no positive effect."
 
 
332
 
333
  def _isolate_service(self, service: str | None) -> tuple[bool, float, str]:
334
  assert service is not None
335
- if service != "worker":
336
- self._set_failure("wrong_isolation_target", f"Isolating {service} does not contain the dominant failure path.")
 
 
 
 
 
337
  return False, self._unsafe_penalty() / 2, f"Isolation of {service} did not materially reduce blast radius."
338
- if self._episode["worker_isolated"]:
339
- return False, 0.0, "Worker is already isolated."
340
- self._episode["worker_isolated"] = True
341
  self._episode["containment_applied"] = True
342
- self._episode["services"]["worker"] = ServiceHealth(
343
- name="worker",
344
- status="isolated",
345
- cpu_pct=8.0,
346
- memory_pct=18.0,
347
- error_rate_pct=0.0,
348
- latency_ms=0.0,
349
- )
350
- self._episode["services"]["database"] = ServiceHealth(
351
- name="database",
352
- status="healthy",
353
- cpu_pct=41.0,
354
- memory_pct=46.0,
355
- error_rate_pct=0.0,
356
- latency_ms=26.0,
357
- )
358
- self._episode["services"]["api-gateway"] = ServiceHealth(
359
- name="api-gateway",
360
- status="degraded",
361
- cpu_pct=34.0,
362
- memory_pct=33.0,
363
- error_rate_pct=7.0,
364
- latency_ms=91.0,
365
- )
366
- self._episode["user_impact"] = 0.45
367
- self._episode["slo_burn_rate"] = 0.47
368
- return True, 0.0, "Worker isolated. Blast radius shrank, but end-to-end service remains degraded until the worker path is restored safely."
369
 
370
  def _run_check(self, check_name: str | None) -> tuple[str, bool, str]:
371
  assert check_name is not None
 
 
 
 
372
  if check_name == "database_recovery":
373
- passed = self._episode["services"]["database"].status == "healthy" and self._episode["cause_removed"]
 
 
 
 
 
374
  detail = (
375
- "Database is healthy and no longer crashing."
376
  if passed
377
  else "Database is still unstable or the triggering cause is still present."
378
  )
379
  else:
 
 
 
380
  passed = (
381
- self._episode["services"]["database"].status == "healthy"
382
- and self._episode["services"]["api-gateway"].status == "healthy"
383
- and self._episode["cause_removed"]
384
- and not self._episode["worker_isolated"]
 
385
  )
386
  detail = (
387
  "End-to-end login traffic is healthy."
@@ -394,7 +430,8 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
394
 
395
  def _declare_resolved(self) -> tuple[bool, float, float, str]:
396
  checks = self._episode["checks"]
397
- safe_to_resolve = checks["database_recovery"].passed and checks["end_to_end"].passed
 
398
  if not safe_to_resolve:
399
  self._set_failure("premature_resolution", "The incident is not verified as resolved yet.")
400
  return False, self._episode["scenario"]["reward_config"]["premature_resolution_penalty"], 0.0, "Resolution declaration rejected: required checks have not passed."
@@ -417,34 +454,14 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
417
  self._episode["why_failed"] = why_failed
418
 
419
  def _advance_world(self) -> None:
420
- if not self._episode["cause_removed"] and not self._episode["worker_isolated"]:
421
- self._episode["services"]["worker"] = ServiceHealth(
422
- name="worker",
423
- status="degraded",
424
- cpu_pct=88.0,
425
- memory_pct=71.0,
426
- error_rate_pct=19.0,
427
- latency_ms=420.0,
428
- )
429
- self._episode["services"]["database"] = ServiceHealth(
430
- name="database",
431
- status="crashed",
432
- cpu_pct=99.0,
433
- memory_pct=97.0,
434
- error_rate_pct=100.0,
435
- latency_ms=0.0,
436
- )
437
- self._episode["services"]["api-gateway"] = ServiceHealth(
438
- name="api-gateway",
439
- status="degraded",
440
- cpu_pct=61.0,
441
- memory_pct=38.0,
442
- error_rate_pct=24.0,
443
- latency_ms=640.0,
444
- )
445
- self._episode["user_impact"] = max(self._episode["user_impact"], 0.82)
446
- self._episode["slo_burn_rate"] = max(self._episode["slo_burn_rate"], 0.91)
447
- if self._episode["worker_isolated"] and not self._episode["cause_removed"]:
448
  self._episode["containment_applied"] = True
449
  self._episode["workflow_stage"] = self._workflow_stage()
450
 
@@ -480,7 +497,7 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
480
  checks = self._episode["checks"]
481
  if checks["database_recovery"].passed or checks["end_to_end"].passed:
482
  return "validation"
483
- if self._episode["containment_applied"] or self._episode["cause_removed"] or self._episode["worker_isolated"]:
484
  return "mitigation"
485
  return "triage"
486
 
@@ -498,12 +515,16 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
498
  "database_recovery": checks["database_recovery"].passed,
499
  "end_to_end": checks["end_to_end"].passed,
500
  "incident_resolved": self._episode["incident_resolved"],
 
501
  }
502
 
503
  def _incident_summary(self) -> str:
 
 
 
504
  return (
505
- "Gateway login traffic is failing because the worker is overloading the database after a recent worker deploy. "
506
- "Use evidence-gathering actions to diagnose, then choose a safe remediation and verify with explicit checks."
507
  )
508
 
509
  def _prompt_text(self, tool_output: str | None) -> str:
@@ -520,6 +541,10 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
520
  lines.extend(f"- [{alert.severity.upper()}] {alert.service}: {alert.message}" for alert in self._episode["alerts"])
521
  else:
522
  lines.append("- none")
 
 
 
 
523
  lines.extend([
524
  "",
525
  "SERVICES:",
@@ -568,6 +593,7 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
568
  "max_ticks": self._episode["max_ticks"],
569
  "workflow_stage": self._episode["workflow_stage"],
570
  "active_alerts": [alert.model_dump() for alert in self._episode["alerts"]],
 
571
  "service_health": {name: service.model_dump() for name, service in self._episode["services"].items()},
572
  "discovered_evidence": list(self._episode["discovered_evidence"]),
573
  "recent_deploys": list(self._episode["recent_deploys"]),
@@ -584,6 +610,8 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
584
  "score_breakdown": dict(self._episode["score_breakdown"]),
585
  "cumulative_reward": self._episode["cumulative_reward"],
586
  "wasteful_ticks": self._episode["wasteful_ticks"],
 
 
587
  "last_action_result": self._episode["last_action_result"],
588
  "failure_type": self._episode["failure_type"],
589
  "why_failed": self._episode["why_failed"],
@@ -598,6 +626,7 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
598
  difficulty=self._episode["difficulty"],
599
  workflow_stage=self._episode["workflow_stage"],
600
  active_alerts=list(self._episode["alerts"]),
 
601
  service_health=dict(self._episode["services"]),
602
  discovered_evidence=list(self._episode["discovered_evidence"]),
603
  recent_deploys=list(self._episode["recent_deploys"]),
@@ -625,4 +654,6 @@ class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncid
625
  score_breakdown=dict(self._episode["score_breakdown"]),
626
  reward=round(reward, 4),
627
  done=done,
 
 
628
  )
 
58
  class UnifiedIncidentEnvironment(Environment[UnifiedIncidentAction, UnifiedIncidentObservation, UnifiedIncidentState]):
59
  """A bounded-action incident diagnosis and safe remediation environment."""
60
 
61
+ SUPPORTS_CONCURRENT_SESSIONS = True
62
 
63
  def __init__(self) -> None:
64
  super().__init__()
 
78
  )
79
 
80
  def reset(self, seed: int | None = None, episode_id: str | None = None, **kwargs: Any) -> UnifiedIncidentObservation:
 
81
  scenario_id = kwargs.get("scenario_id")
82
  difficulty = kwargs.get("difficulty")
83
  if scenario_id:
84
  scenario = get_scenario(scenario_id)
85
  elif difficulty:
86
+ scenario = scenario_for_difficulty(difficulty, seed=seed)
87
  else:
88
  scenario = get_scenario(DEFAULT_SCENARIO_ID)
89
  self._episode = self._make_episode(scenario, episode_id=episode_id)
 
203
  "database_recovery": CheckResult(name="database_recovery", passed=False, detail="Database recovery has not been verified yet."),
204
  "end_to_end": CheckResult(name="end_to_end", passed=False, detail="End-to-end health has not been verified yet."),
205
  }
206
+ recipe = scenario.get("remediation_recipe", {})
207
+ rollback_target = recipe.get("rollback_target", "worker")
208
+ recent_deploy_service = rollback_target if rollback_target in scenario["deploy_history"] else "worker"
209
+ knobs = scenario.get("difficulty_knobs", {})
210
+ noise_alerts = [Alert(**payload) for payload in knobs.get("noise_alerts", [])]
211
  return {
212
  "episode_id": episode_id or str(uuid.uuid4()),
213
  "scenario": scenario,
 
217
  "difficulty": scenario["difficulty"],
218
  "services": services,
219
  "alerts": [Alert(**payload) for payload in scenario["initial_alerts"]],
220
+ "noise_alerts": noise_alerts,
221
  "discovered_evidence": [],
222
  "evidence_seen": set(),
223
+ "recent_deploys": [scenario["deploy_history"].get(recent_deploy_service, "")],
224
  "checks": checks,
225
+ "user_impact": scenario.get("degraded_user_impact", 0.82),
226
+ "slo_burn_rate": scenario.get("degraded_slo_burn", 0.91),
227
  "containment_applied": False,
228
  "cause_removed": False,
229
+ "isolated_service": None,
 
230
  "hypothesis_seen": set(),
231
  "failure_type": None,
232
  "why_failed": None,
 
237
  "workflow_stage": "triage",
238
  "cumulative_reward": 0.0,
239
  "wasteful_ticks": 0,
240
+ "blast_radius": 0,
241
+ "noise_queries": 0,
242
  "score_breakdown": {
243
  "recovery_score": 0.0,
244
  "containment_score": 0.0,
245
  "verification_score": 0.0,
246
  "impact_score": 0.0,
247
+ "efficiency_score": 0.05,
248
+ "speed_bonus": 0.0,
249
+ "noise_handling_score": 0.05 if knobs.get("noise_services") else 0.0,
250
  "final_score": 0.10,
251
  },
252
  "final_score": 0.10,
 
254
  "done": False,
255
  }
256
 
257
+ def _noise_knobs(self) -> dict[str, Any]:
258
+ return self._episode["scenario"].get("difficulty_knobs", {})
259
+
260
+ def _is_noise_service(self, service: str) -> bool:
261
+ return service in set(self._noise_knobs().get("noise_services", []))
262
+
263
+ def _record_noise_query(self, service: str) -> None:
264
+ if self._is_noise_service(service):
265
+ self._episode["noise_queries"] = self._episode.get("noise_queries", 0) + 1
266
+
267
  def _query_logs(self, service: str | None) -> str:
268
  assert service is not None
269
+ if self._is_noise_service(service):
270
+ self._record_noise_query(service)
271
+ noise_logs = self._noise_knobs().get("noise_logs", {})
272
+ detail = noise_logs.get(service, f"{service} logs show no incident-correlated regression.")
273
+ return f"{service}: {detail}"
274
  return self._episode["scenario"]["logs"][service]
275
 
276
  def _query_metrics(self, service: str | None, metric: str | None) -> str:
277
  assert service is not None and metric is not None
278
+ if self._is_noise_service(service):
279
+ self._record_noise_query(service)
280
+ return f"{service} {metric} metrics are within ordinary background variance and unrelated to the active incident."
281
  return self._episode["scenario"]["metrics"][service][metric]
282
 
283
  def _query_dependencies(self, service: str | None) -> str:
284
  assert service is not None
285
+ if self._is_noise_service(service):
286
+ self._record_noise_query(service)
287
+ return f"{service} is off the primary user-impact path and is not driving the incident."
288
  return self._episode["scenario"]["dependencies"][service]
289
 
290
  def _query_deploys(self, service: str | None) -> str:
291
  assert service is not None
292
+ if self._is_noise_service(service):
293
+ self._record_noise_query(service)
294
+ return f"No recent {service} deploy correlates with the active incident timeline."
295
  return self._episode["scenario"]["deploy_history"][service]
296
 
297
  def _submit_hypothesis(self, action: UnifiedIncidentAction) -> tuple[float, bool, str]:
 
312
  reward = (0.04 * cause_match) + (0.03 * service_match) + (0.03 * action_quality) + (0.02 * calibration)
313
  return round(reward, 4), True, "Hypothesis recorded. Reward reflects root-cause accuracy, service localization, confidence calibration, and next-action quality."
314
 
315
+ def _recipe(self) -> dict[str, Any]:
316
+ return self._episode["scenario"].get("remediation_recipe", {})
317
+
318
+ def _failure_message(self, key: str, default: str) -> str:
319
+ return self._episode["scenario"].get("failure_messages", {}).get(key, default)
320
+
321
+ def _apply_service_updates(self, updates: dict[str, dict[str, Any]]) -> None:
322
+ for name, payload in updates.items():
323
+ self._episode["services"][name] = ServiceHealth(name=name, **payload)
324
+
325
+ def _bump_blast_radius(self) -> None:
326
+ self._episode["blast_radius"] = self._episode.get("blast_radius", 0) + 1
327
+
328
  def _rollback_deploy(self, service: str | None) -> tuple[bool, float, str]:
329
  assert service is not None
330
+ recipe = self._recipe()
331
+ rollback_target = recipe.get("rollback_target")
332
+ if rollback_target is None or service != rollback_target:
333
+ self._set_failure(
334
+ "wrong_remediation_target",
335
+ self._failure_message("wrong_rollback_target", "Rolling back a service without a causal link wastes time and risk."),
336
+ )
337
  return False, self._unsafe_penalty(), f"Rollback on {service} did not address the incident."
338
  if self._episode["cause_removed"]:
339
+ return False, 0.0, f"{rollback_target} deploy is already rolled back."
340
  self._episode["cause_removed"] = True
341
  self._episode["containment_applied"] = True
342
+ self._bump_blast_radius()
343
+ self._apply_service_updates(self._episode["scenario"].get("post_rollback_services", {}))
344
+ scenario = self._episode["scenario"]
345
+ self._episode["user_impact"] = min(self._episode["user_impact"], scenario.get("post_rollback_user_impact", self._episode["user_impact"]))
346
+ self._episode["slo_burn_rate"] = min(self._episode["slo_burn_rate"], scenario.get("post_rollback_slo_burn", self._episode["slo_burn_rate"]))
347
+ return True, 0.0, f"Rolled back the {rollback_target} deploy; the underlying cause is removed."
 
 
 
 
 
 
348
 
349
  def _restart_service(self, service: str | None) -> tuple[bool, float, str]:
350
  assert service is not None
351
+ recipe = self._recipe()
352
+ restart_target = recipe.get("restart_target")
353
+ if restart_target is None or service != restart_target:
354
+ self._set_failure(
355
+ "low_value_restart",
356
+ self._failure_message("low_value_restart", f"Restarting {service} is not the safe next remediation step for this incident."),
 
 
 
 
 
357
  )
358
+ return False, self._unsafe_penalty() / 2, f"Restarting {service} had little or no positive effect."
359
+ if recipe.get("restart_requires_cause_removed", True) and not self._episode["cause_removed"]:
360
+ self._set_failure(
361
+ "premature_restart",
362
+ self._failure_message("premature_restart", f"Restarting {service} before removing the trigger only causes another failure."),
 
 
363
  )
364
+ return False, self._unsafe_penalty(), f"Restart of {service} failed because the triggering cause is still present."
365
+ self._bump_blast_radius()
366
+ self._apply_service_updates(self._episode["scenario"].get("post_restart_services", {}))
367
+ scenario = self._episode["scenario"]
368
+ self._episode["user_impact"] = scenario.get("post_restart_user_impact", self._episode["user_impact"])
369
+ self._episode["slo_burn_rate"] = scenario.get("post_restart_slo_burn", self._episode["slo_burn_rate"])
370
+ return True, 0.0, f"{service} restarted cleanly after the triggering cause was removed."
371
 
372
  def _isolate_service(self, service: str | None) -> tuple[bool, float, str]:
373
  assert service is not None
374
+ recipe = self._recipe()
375
+ isolate_target = recipe.get("isolate_target")
376
+ if isolate_target is None or service != isolate_target:
377
+ self._set_failure(
378
+ "wrong_isolation_target",
379
+ self._failure_message("wrong_isolation_target", f"Isolating {service} does not contain the dominant failure path."),
380
+ )
381
  return False, self._unsafe_penalty() / 2, f"Isolation of {service} did not materially reduce blast radius."
382
+ if self._episode["isolated_service"] == isolate_target:
383
+ return False, 0.0, f"{isolate_target} is already isolated."
384
+ self._episode["isolated_service"] = isolate_target
385
  self._episode["containment_applied"] = True
386
+ self._bump_blast_radius()
387
+ self._apply_service_updates(self._episode["scenario"].get("post_isolate_services", {}))
388
+ scenario = self._episode["scenario"]
389
+ self._episode["user_impact"] = scenario.get("post_isolate_user_impact", self._episode["user_impact"])
390
+ self._episode["slo_burn_rate"] = scenario.get("post_isolate_slo_burn", self._episode["slo_burn_rate"])
391
+ return True, 0.0, f"{isolate_target} isolated. Blast radius shrank, but full resolution still requires addressing the root cause."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
392
 
393
  def _run_check(self, check_name: str | None) -> tuple[str, bool, str]:
394
  assert check_name is not None
395
+ recipe = self._recipe()
396
+ isolated = self._episode["isolated_service"]
397
+ cause_removed = self._episode["cause_removed"]
398
+ services = self._episode["services"]
399
  if check_name == "database_recovery":
400
+ db_healthy = services["database"].status == "healthy"
401
+ incident_driver = recipe.get("incident_driver")
402
+ if incident_driver in {"worker", "database"}:
403
+ passed = db_healthy and cause_removed
404
+ else:
405
+ passed = db_healthy
406
  detail = (
407
+ "Database is healthy and no longer failing."
408
  if passed
409
  else "Database is still unstable or the triggering cause is still present."
410
  )
411
  else:
412
+ gateway_healthy = services["api-gateway"].status == "healthy"
413
+ db_healthy = services["database"].status == "healthy"
414
+ worker_healthy = services["worker"].status == "healthy"
415
  passed = (
416
+ gateway_healthy
417
+ and db_healthy
418
+ and worker_healthy
419
+ and cause_removed
420
+ and isolated is None
421
  )
422
  detail = (
423
  "End-to-end login traffic is healthy."
 
430
 
431
  def _declare_resolved(self) -> tuple[bool, float, float, str]:
432
  checks = self._episode["checks"]
433
+ resolution_check = self._recipe().get("resolution_check", "end_to_end")
434
+ safe_to_resolve = bool(checks.get(resolution_check) and checks[resolution_check].passed)
435
  if not safe_to_resolve:
436
  self._set_failure("premature_resolution", "The incident is not verified as resolved yet.")
437
  return False, self._episode["scenario"]["reward_config"]["premature_resolution_penalty"], 0.0, "Resolution declaration rejected: required checks have not passed."
 
454
  self._episode["why_failed"] = why_failed
455
 
456
  def _advance_world(self) -> None:
457
+ cause_removed = self._episode["cause_removed"]
458
+ isolated = self._episode["isolated_service"]
459
+ if not cause_removed and isolated is None:
460
+ self._apply_service_updates(self._episode["scenario"].get("degraded_services", {}))
461
+ scenario = self._episode["scenario"]
462
+ self._episode["user_impact"] = max(self._episode["user_impact"], scenario.get("degraded_user_impact", self._episode["user_impact"]))
463
+ self._episode["slo_burn_rate"] = max(self._episode["slo_burn_rate"], scenario.get("degraded_slo_burn", self._episode["slo_burn_rate"]))
464
+ if isolated is not None and not cause_removed:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
465
  self._episode["containment_applied"] = True
466
  self._episode["workflow_stage"] = self._workflow_stage()
467
 
 
497
  checks = self._episode["checks"]
498
  if checks["database_recovery"].passed or checks["end_to_end"].passed:
499
  return "validation"
500
+ if self._episode["containment_applied"] or self._episode["cause_removed"] or self._episode["isolated_service"] is not None:
501
  return "mitigation"
502
  return "triage"
503
 
 
515
  "database_recovery": checks["database_recovery"].passed,
516
  "end_to_end": checks["end_to_end"].passed,
517
  "incident_resolved": self._episode["incident_resolved"],
518
+ "isolation_applied": self._episode["isolated_service"] is not None,
519
  }
520
 
521
  def _incident_summary(self) -> str:
522
+ description = self._episode["scenario"].get("description")
523
+ if description:
524
+ return description
525
  return (
526
+ "An incident is degrading user traffic. Use evidence-gathering actions to diagnose, "
527
+ "then choose a safe remediation and verify with explicit checks."
528
  )
529
 
530
  def _prompt_text(self, tool_output: str | None) -> str:
 
541
  lines.extend(f"- [{alert.severity.upper()}] {alert.service}: {alert.message}" for alert in self._episode["alerts"])
542
  else:
543
  lines.append("- none")
544
+ noise = self._episode.get("noise_alerts", [])
545
+ if noise:
546
+ lines.extend(["", "NOISE_ALERTS (historically unrelated — resist querying these):"])
547
+ lines.extend(f"- [{alert.severity.upper()}] {alert.service}: {alert.message}" for alert in noise)
548
  lines.extend([
549
  "",
550
  "SERVICES:",
 
593
  "max_ticks": self._episode["max_ticks"],
594
  "workflow_stage": self._episode["workflow_stage"],
595
  "active_alerts": [alert.model_dump() for alert in self._episode["alerts"]],
596
+ "noise_alerts": [alert.model_dump() for alert in self._episode.get("noise_alerts", [])],
597
  "service_health": {name: service.model_dump() for name, service in self._episode["services"].items()},
598
  "discovered_evidence": list(self._episode["discovered_evidence"]),
599
  "recent_deploys": list(self._episode["recent_deploys"]),
 
610
  "score_breakdown": dict(self._episode["score_breakdown"]),
611
  "cumulative_reward": self._episode["cumulative_reward"],
612
  "wasteful_ticks": self._episode["wasteful_ticks"],
613
+ "blast_radius": self._episode.get("blast_radius", 0),
614
+ "noise_queries": self._episode.get("noise_queries", 0),
615
  "last_action_result": self._episode["last_action_result"],
616
  "failure_type": self._episode["failure_type"],
617
  "why_failed": self._episode["why_failed"],
 
626
  difficulty=self._episode["difficulty"],
627
  workflow_stage=self._episode["workflow_stage"],
628
  active_alerts=list(self._episode["alerts"]),
629
+ noise_alerts=list(self._episode.get("noise_alerts", [])),
630
  service_health=dict(self._episode["services"]),
631
  discovered_evidence=list(self._episode["discovered_evidence"]),
632
  recent_deploys=list(self._episode["recent_deploys"]),
 
654
  score_breakdown=dict(self._episode["score_breakdown"]),
655
  reward=round(reward, 4),
656
  done=done,
657
+ blast_radius=int(self._episode.get("blast_radius", 0)),
658
+ noise_queries=int(self._episode.get("noise_queries", 0)),
659
  )
unified_incident_env/server/grader.py CHANGED
@@ -24,7 +24,23 @@ def _service_score(status: str) -> float:
24
 
25
 
26
  class UnifiedIncidentGrader:
27
- """Deterministic scorer focused on executed effects, not scripted clues."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  def compute_breakdown(
30
  self,
@@ -33,33 +49,64 @@ class UnifiedIncidentGrader:
33
  ) -> dict[str, float]:
34
  services = state.get("service_health", {})
35
  weights = scenario["critical_service_weights"]
36
- recovery_score = round(
37
- sum(
38
- weights.get(service, 0.0) * _service_score((services.get(service) or {}).get("status", "crashed"))
39
- for service in weights
40
- ),
41
- 4,
42
  )
 
43
 
44
- containment_score = 0.2 if state.get("containment_applied") else 0.0
45
- if state.get("containment_applied") and (services.get("worker") or {}).get("status") == "healthy":
46
- containment_score = 0.3
 
 
 
 
 
 
 
 
47
 
48
  checks = {item.get("name"): bool(item.get("passed")) for item in state.get("checks", [])}
49
  verification_score = 0.0
50
  if checks.get("database_recovery"):
51
- verification_score += 0.15
52
  if checks.get("end_to_end"):
53
- verification_score += 0.2
54
 
55
  user_impact = float(state.get("user_impact", 1.0))
56
- impact_score = round(max(0.0, 0.15 * (1.0 - user_impact)), 4)
57
 
58
  wasteful_ticks = int(state.get("wasteful_ticks", 0))
59
- efficiency_score = round(max(0.0, 0.10 - (0.01 * wasteful_ticks)), 4)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
  final_score = _strict_public_score(
62
- recovery_score + containment_score + verification_score + impact_score + efficiency_score
 
 
 
 
 
 
63
  )
64
 
65
  return {
@@ -68,6 +115,8 @@ class UnifiedIncidentGrader:
68
  "verification_score": round(verification_score, 4),
69
  "impact_score": impact_score,
70
  "efficiency_score": efficiency_score,
 
 
71
  "final_score": final_score,
72
  }
73
 
@@ -88,7 +137,7 @@ class UnifiedIncidentGrader:
88
  if state.get("containment_applied")
89
  else "The root cause is still active or only partially contained."
90
  ),
91
- weight=0.30,
92
  ),
93
  GraderCheck(
94
  name="database_recovery",
@@ -98,7 +147,7 @@ class UnifiedIncidentGrader:
98
  if checks.get("database_recovery")
99
  else "The database recovery check has not passed yet."
100
  ),
101
- weight=0.20,
102
  ),
103
  GraderCheck(
104
  name="end_to_end_check",
@@ -112,10 +161,10 @@ class UnifiedIncidentGrader:
112
  ),
113
  GraderCheck(
114
  name="critical_services_recovered",
115
- passed=breakdown["recovery_score"] >= 0.8,
116
  detail=(
117
  "Critical-path services are recovered."
118
- if breakdown["recovery_score"] >= 0.8
119
  else "Critical-path services are still degraded or crashed."
120
  ),
121
  weight=0.20,
@@ -130,6 +179,26 @@ class UnifiedIncidentGrader:
130
  ),
131
  weight=0.10,
132
  ),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
  ]
134
  return GraderReport(
135
  scenario_id=scenario["id"],
 
24
 
25
 
26
  class UnifiedIncidentGrader:
27
+ """Deterministic scorer focused on executed effects, not scripted clues.
28
+
29
+ Hardened schedule (post Track-A headroom patch):
30
+
31
+ - recovery 0.00 – 0.25
32
+ - containment 0.00 – 0.15
33
+ - verification 0.00 – 0.20
34
+ - impact 0.00 – 0.05
35
+ - efficiency 0.00 – 0.05
36
+ - speed_bonus 0.00 – 0.10 (positive only when faster than optimal)
37
+ - noise_handling 0.00 – 0.05 (penalizes querying noise services)
38
+
39
+ Scripted deterministic baseline (which matches optimal_ticks exactly and
40
+ avoids noise queries) caps at ~0.70. Headroom 0.70 → 0.85 is reachable only
41
+ by an agent that (a) is strictly faster than optimal and (b) touches zero
42
+ noise services. That's the training target.
43
+ """
44
 
45
  def compute_breakdown(
46
  self,
 
49
  ) -> dict[str, float]:
50
  services = state.get("service_health", {})
51
  weights = scenario["critical_service_weights"]
52
+ recovery_raw = sum(
53
+ weights.get(service, 0.0) * _service_score((services.get(service) or {}).get("status", "crashed"))
54
+ for service in weights
 
 
 
55
  )
56
+ recovery_score = round(0.25 * recovery_raw, 4)
57
 
58
+ contained = bool(state.get("containment_applied"))
59
+ rollback_target = scenario.get("remediation_recipe", {}).get("rollback_target")
60
+ rollback_service_healthy = bool(
61
+ rollback_target and (services.get(rollback_target) or {}).get("status") == "healthy"
62
+ )
63
+ if contained and rollback_service_healthy:
64
+ containment_score = 0.15
65
+ elif contained:
66
+ containment_score = 0.10
67
+ else:
68
+ containment_score = 0.0
69
 
70
  checks = {item.get("name"): bool(item.get("passed")) for item in state.get("checks", [])}
71
  verification_score = 0.0
72
  if checks.get("database_recovery"):
73
+ verification_score += 0.08
74
  if checks.get("end_to_end"):
75
+ verification_score += 0.12
76
 
77
  user_impact = float(state.get("user_impact", 1.0))
78
+ impact_score = round(max(0.0, 0.05 * (1.0 - user_impact)), 4)
79
 
80
  wasteful_ticks = int(state.get("wasteful_ticks", 0))
81
+ efficiency_score = round(max(0.0, 0.05 - (0.005 * wasteful_ticks)), 4)
82
+
83
+ # speed_bonus: fully earned only if the agent finishes well under optimal_ticks.
84
+ optimal_ticks = int(scenario.get("optimal_ticks", 10))
85
+ current_tick = int(state.get("current_tick", 0))
86
+ incident_resolved = bool(state.get("incident_resolved"))
87
+ if incident_resolved and current_tick > 0 and current_tick < optimal_ticks:
88
+ speed_bonus = round(0.10 * (optimal_ticks - current_tick) / optimal_ticks, 4)
89
+ elif incident_resolved and current_tick == optimal_ticks:
90
+ speed_bonus = 0.0
91
+ else:
92
+ speed_bonus = 0.0
93
+
94
+ # noise_handling: deduct per query against a noise service, up to the cap of 0.05.
95
+ noise_services = set(scenario.get("difficulty_knobs", {}).get("noise_services", []))
96
+ noise_queries = int(state.get("noise_queries", 0))
97
+ if noise_services:
98
+ noise_handling_score = round(max(0.0, 0.05 - 0.015 * noise_queries), 4)
99
+ else:
100
+ noise_handling_score = 0.0
101
 
102
  final_score = _strict_public_score(
103
+ recovery_score
104
+ + containment_score
105
+ + verification_score
106
+ + impact_score
107
+ + efficiency_score
108
+ + speed_bonus
109
+ + noise_handling_score
110
  )
111
 
112
  return {
 
115
  "verification_score": round(verification_score, 4),
116
  "impact_score": impact_score,
117
  "efficiency_score": efficiency_score,
118
+ "speed_bonus": speed_bonus,
119
+ "noise_handling_score": noise_handling_score,
120
  "final_score": final_score,
121
  }
122
 
 
137
  if state.get("containment_applied")
138
  else "The root cause is still active or only partially contained."
139
  ),
140
+ weight=0.20,
141
  ),
142
  GraderCheck(
143
  name="database_recovery",
 
147
  if checks.get("database_recovery")
148
  else "The database recovery check has not passed yet."
149
  ),
150
+ weight=0.15,
151
  ),
152
  GraderCheck(
153
  name="end_to_end_check",
 
161
  ),
162
  GraderCheck(
163
  name="critical_services_recovered",
164
+ passed=breakdown["recovery_score"] >= 0.20,
165
  detail=(
166
  "Critical-path services are recovered."
167
+ if breakdown["recovery_score"] >= 0.20
168
  else "Critical-path services are still degraded or crashed."
169
  ),
170
  weight=0.20,
 
179
  ),
180
  weight=0.10,
181
  ),
182
+ GraderCheck(
183
+ name="speed_bonus_earned",
184
+ passed=breakdown.get("speed_bonus", 0.0) > 0.0,
185
+ detail=(
186
+ "Resolved faster than optimal_ticks."
187
+ if breakdown.get("speed_bonus", 0.0) > 0.0
188
+ else "Did not beat optimal tick budget."
189
+ ),
190
+ weight=0.10,
191
+ ),
192
+ GraderCheck(
193
+ name="noise_handling",
194
+ passed=breakdown.get("noise_handling_score", 0.0) >= 0.035,
195
+ detail=(
196
+ "Minimal or no queries against noise services."
197
+ if breakdown.get("noise_handling_score", 0.0) >= 0.035
198
+ else "Wasted queries on noise services."
199
+ ),
200
+ weight=0.05,
201
+ ),
202
  ]
203
  return GraderReport(
204
  scenario_id=scenario["id"],
unified_incident_env/tests/test_environment.py CHANGED
@@ -6,7 +6,7 @@ from fastapi.testclient import TestClient
6
 
7
  from unified_incident_env.models import HypothesisPayload, UnifiedIncidentAction
8
  from unified_incident_env.server import app as app_module
9
- from unified_incident_env.server.challenge import DEFAULT_SCENARIO_ID, list_baselines
10
  from unified_incident_env.server.environment import UnifiedIncidentEnvironment
11
 
12
 
@@ -27,7 +27,7 @@ def test_baseline_resolves_honestly() -> None:
27
  checks = {check.name: check.passed for check in obs.checks}
28
  assert checks["database_recovery"] is True
29
  assert checks["end_to_end"] is True
30
- assert obs.final_score > 0.7
31
 
32
 
33
  def test_query_deploys_reveals_evidence_but_not_positive_reward() -> None:
@@ -114,12 +114,15 @@ def test_routes_expose_new_catalog_and_status(monkeypatch) -> None:
114
  assert tasks.status_code == 200
115
  payload = tasks.json()
116
  assert payload["default_scenario_id"] == DEFAULT_SCENARIO_ID
117
- assert len(payload["scenarios"]) == 1
 
 
118
 
119
  baseline = client.get("/baseline")
120
  assert baseline.status_code == 200
121
  baseline_payload = baseline.json()
122
- assert baseline_payload["baselines"][0]["scenario_id"] == DEFAULT_SCENARIO_ID
 
123
 
124
  health = client.get("/health")
125
  assert health.status_code == 200
@@ -130,3 +133,143 @@ def test_routes_expose_new_catalog_and_status(monkeypatch) -> None:
130
  status_payload = status.json()
131
  assert status_payload["progress"]["scenario_id"] == DEFAULT_SCENARIO_ID
132
  assert status_payload["grader"]["score"] > 0.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  from unified_incident_env.models import HypothesisPayload, UnifiedIncidentAction
8
  from unified_incident_env.server import app as app_module
9
+ from unified_incident_env.server.challenge import DEFAULT_SCENARIO_ID, SCENARIOS, list_baselines, scenario_for_difficulty
10
  from unified_incident_env.server.environment import UnifiedIncidentEnvironment
11
 
12
 
 
27
  checks = {check.name: check.passed for check in obs.checks}
28
  assert checks["database_recovery"] is True
29
  assert checks["end_to_end"] is True
30
+ assert obs.final_score > 0.55
31
 
32
 
33
  def test_query_deploys_reveals_evidence_but_not_positive_reward() -> None:
 
114
  assert tasks.status_code == 200
115
  payload = tasks.json()
116
  assert payload["default_scenario_id"] == DEFAULT_SCENARIO_ID
117
+ scenarios_by_difficulty = {scenario["difficulty"] for scenario in payload["scenarios"]}
118
+ assert {"easy", "medium", "hard"}.issubset(scenarios_by_difficulty)
119
+ assert {"easy", "medium", "hard"}.issubset(set(payload["available_difficulties"]))
120
 
121
  baseline = client.get("/baseline")
122
  assert baseline.status_code == 200
123
  baseline_payload = baseline.json()
124
+ baseline_ids = {item["scenario_id"] for item in baseline_payload["baselines"]}
125
+ assert {"worker_deploy_cascade", "db_config_rollout", "gateway_auth_rollout"}.issubset(baseline_ids)
126
 
127
  health = client.get("/health")
128
  assert health.status_code == 200
 
133
  status_payload = status.json()
134
  assert status_payload["progress"]["scenario_id"] == DEFAULT_SCENARIO_ID
135
  assert status_payload["grader"]["score"] > 0.0
136
+
137
+
138
+ def _run_baseline_for_scenario(scenario_id: str):
139
+ env = UnifiedIncidentEnvironment()
140
+ env.reset(scenario_id=scenario_id)
141
+ last = None
142
+ for step in list_baselines(scenario_id).baselines[0].actions:
143
+ last = env.step(step.action)
144
+ return last
145
+
146
+
147
+ def test_medium_baseline_resolves_honestly() -> None:
148
+ obs = _run_baseline_for_scenario("db_config_rollout")
149
+ assert obs is not None
150
+ assert obs.done is True
151
+ assert obs.incident_resolved is True
152
+ checks = {check.name: check.passed for check in obs.checks}
153
+ assert checks["database_recovery"] is True
154
+ assert checks["end_to_end"] is True
155
+ assert obs.final_score > 0.55
156
+
157
+
158
+ def test_hard_baseline_resolves_honestly() -> None:
159
+ obs = _run_baseline_for_scenario("gateway_auth_rollout")
160
+ assert obs is not None
161
+ assert obs.done is True
162
+ assert obs.incident_resolved is True
163
+ checks = {check.name: check.passed for check in obs.checks}
164
+ assert checks["end_to_end"] is True
165
+ assert obs.final_score > 0.55
166
+
167
+
168
+ def test_medium_wrong_rollback_target_is_penalized() -> None:
169
+ env = UnifiedIncidentEnvironment()
170
+ env.reset(scenario_id="db_config_rollout")
171
+ obs = env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="worker"))
172
+ assert obs.reward < 0.0
173
+ assert obs.failure_type == "wrong_remediation_target"
174
+ assert obs.incident_resolved is False
175
+
176
+
177
+ def test_hard_wrong_rollback_target_is_penalized() -> None:
178
+ env = UnifiedIncidentEnvironment()
179
+ env.reset(scenario_id="gateway_auth_rollout")
180
+ obs = env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="worker"))
181
+ assert obs.reward < 0.0
182
+ assert obs.failure_type == "wrong_remediation_target"
183
+
184
+
185
+ def test_all_scenarios_expose_noise_alerts() -> None:
186
+ env = UnifiedIncidentEnvironment()
187
+ for scenario_id in ("worker_deploy_cascade", "db_config_rollout", "gateway_auth_rollout"):
188
+ obs = env.reset(scenario_id=scenario_id)
189
+ assert len(obs.noise_alerts) > 0, f"{scenario_id} should expose noise_alerts"
190
+ assert all(alert.message for alert in obs.noise_alerts)
191
+
192
+
193
+ def test_blast_radius_increments_on_mitigations() -> None:
194
+ env = UnifiedIncidentEnvironment()
195
+ env.reset(scenario_id="worker_deploy_cascade")
196
+ obs0 = env.step(UnifiedIncidentAction(action_type="query_logs", service="worker"))
197
+ assert obs0.blast_radius == 0
198
+ env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="worker"))
199
+ obs2 = env.step(UnifiedIncidentAction(action_type="restart_service", service="database"))
200
+ assert obs2.blast_radius == 2
201
+
202
+
203
+ def test_baseline_ceiling_is_hardened_below_080() -> None:
204
+ """Scripted-optimal baseline must not score above ~0.80. Headroom left
205
+ for a trained agent that earns speed_bonus by finishing faster than
206
+ optimal_ticks."""
207
+ for scenario_id in ("worker_deploy_cascade", "db_config_rollout", "gateway_auth_rollout"):
208
+ obs = _run_baseline_for_scenario(scenario_id)
209
+ assert obs is not None
210
+ assert obs.final_score <= 0.80, f"{scenario_id} ceiling {obs.final_score} exceeds headroom budget"
211
+ assert obs.final_score >= 0.55, f"{scenario_id} ceiling {obs.final_score} is too low; env is unsolvable"
212
+
213
+
214
+ def test_speed_bonus_rewards_finishing_under_optimal_ticks() -> None:
215
+ """A faster solve that keeps both verification checks should beat the
216
+ baseline ceiling by the speed_bonus margin. This is the training target
217
+ — trained agents that skip verification to chase speed should score
218
+ *lower*, not higher."""
219
+ env = UnifiedIncidentEnvironment()
220
+ env.reset(scenario_id="gateway_auth_rollout")
221
+ # 5-step path: 1 query + 1 rollback + 2 checks + 1 declare. Baseline does 8.
222
+ env.step(UnifiedIncidentAction(action_type="query_deploys", service="api-gateway"))
223
+ env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="api-gateway"))
224
+ env.step(UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"))
225
+ env.step(UnifiedIncidentAction(action_type="run_check", check_name="database_recovery"))
226
+ obs = env.step(UnifiedIncidentAction(action_type="declare_resolved"))
227
+ assert obs.incident_resolved is True
228
+ assert obs.score_breakdown.get("speed_bonus", 0) > 0.0
229
+ assert obs.final_score > 0.74, f"Faster solve with full verification should beat baseline, got {obs.final_score}"
230
+
231
+
232
+ def test_hard_does_not_require_database_recovery_check() -> None:
233
+ env = UnifiedIncidentEnvironment()
234
+ env.reset(scenario_id="gateway_auth_rollout")
235
+ env.step(UnifiedIncidentAction(action_type="rollback_deploy", service="api-gateway"))
236
+ end_to_end = env.step(UnifiedIncidentAction(action_type="run_check", check_name="end_to_end"))
237
+ assert any(check.name == "end_to_end" and check.passed for check in end_to_end.checks)
238
+ resolved = env.step(UnifiedIncidentAction(action_type="declare_resolved"))
239
+ assert resolved.incident_resolved is True
240
+
241
+
242
+ def test_procgen_catalog_registers_variants_for_each_template() -> None:
243
+ procgen_ids = {scenario_id for scenario_id, scenario in SCENARIOS.items() if scenario.get("is_procgen")}
244
+ assert any(scenario_id.startswith("worker_deploy_cascade__p") for scenario_id in procgen_ids)
245
+ assert any(scenario_id.startswith("db_config_rollout__p") for scenario_id in procgen_ids)
246
+ assert any(scenario_id.startswith("gateway_auth_rollout__p") for scenario_id in procgen_ids)
247
+
248
+
249
+ def test_scenario_for_difficulty_seed_is_deterministic() -> None:
250
+ first = scenario_for_difficulty("medium", seed=7)
251
+ second = scenario_for_difficulty("medium", seed=7)
252
+ assert first["id"] == second["id"]
253
+ assert first["difficulty"] == "medium"
254
+
255
+
256
+ def test_procgen_variant_baseline_routes_through_template_builder() -> None:
257
+ scenario_id = next(
258
+ current_id
259
+ for current_id, scenario in SCENARIOS.items()
260
+ if scenario.get("is_procgen") and scenario.get("template_id") == "db_config_rollout"
261
+ )
262
+ obs = _run_baseline_for_scenario(scenario_id)
263
+ assert obs is not None
264
+ assert obs.incident_resolved is True
265
+ assert obs.final_score >= 0.55
266
+
267
+
268
+ def test_noise_service_queries_are_scored_as_noise() -> None:
269
+ env = UnifiedIncidentEnvironment()
270
+ obs = env.reset(scenario_id="gateway_auth_rollout__p01")
271
+ noise_service = obs.noise_alerts[0].service
272
+ noise_obs = env.step(UnifiedIncidentAction(action_type="query_logs", service=noise_service))
273
+ assert noise_obs.noise_queries == 1
274
+ assert noise_service in (noise_obs.tool_output or "")
275
+ assert noise_obs.score_breakdown["noise_handling_score"] < 0.05
uv.lock DELETED
The diff for this file is too large to render. See raw diff