yxc20098 commited on
Commit
cb15568
ยท
1 Parent(s): 4a5b0dd

Add handoff ablation (recover-from-deficit / capitalize-on-advantage)

Browse files

A `prefix` controller plays the first K turns of an episode, then the
model inherits the live game state and finishes it โ€” a pure STATE
handoff ("take over from here"), no transcript carried over, so it
stays orthogonal to the in-context-learning axis.

Sweeping the prefix quality decomposes a reasoning capability:
- a losing prefix (`stall`) hands the model a deficit โ€” the recovery
test;
- a winning prefix (a replayed winning trajectory) hands it an
advantage โ€” the capitalize test.

Every result carries a `passivity` stat โ€” the fraction of the model's
turns spent on `observe`/`stop` only. Under the bad-prefix deficit
that is passivity-under-pressure: a number for the freeze-and-panic
failure mode (when losing, models tend to stop/observe instead of
ordering an active retreat or redirect).

- handoff.py: TrajectoryController (replays a recorded run's per-turn
tool_calls from messages.json), HandoffController (k-turn prefix โ†’
main switch + passivity tracking), run_handoff helper.
- run_eval.py: --handoff-sweep expands each pack:level into
handoff-{base,bad,good} cells; --handoff-k, --handoff-bank.

Pinned by tests/test_handoff_ablation.py.

CLAUDE.md CHANGED
@@ -228,6 +228,19 @@ A scenario is defective if any of the following hold:
228
  the full grid with `run_eval --perception-sweep` (expands every
229
  `pack:level` into `pack:level:<mode>`); the human Play tab stays
230
  on the canonical `vision` (fogged) modality.
 
 
 
 
 
 
 
 
 
 
 
 
 
231
  - **`pbox` costs 600** (not the 400 some old specs assumed);
232
  defense and infantry are SEPARATE production queues so an
233
  efficient policy queues `build('pbox')` and `build('e1')` in
 
228
  the full grid with `run_eval --perception-sweep` (expands every
229
  `pack:level` into `pack:level:<mode>`); the human Play tab stays
230
  on the canonical `vision` (fogged) modality.
231
+ - **Handoff ablation** (`openra_bench/handoff.py`, `run_eval
232
+ --handoff-sweep`). A `HandoffController` lets a `prefix` controller
233
+ play the first K turns, then the model inherits the live game state
234
+ ("take over from here" โ€” a pure STATE handoff, no transcript). A
235
+ `stall` prefix hands the model a losing position (recovery test); a
236
+ replayed winning trajectory (`TrajectoryController`, sourced from a
237
+ `--handoff-bank` of Playback runs) hands it a winning one
238
+ (capitalize-on-advantage). Sweep cells are
239
+ `pack:level:handoff-{base,bad,good}`. Every result carries a
240
+ `passivity` stat โ€” the fraction of the model's turns spent on
241
+ `observe`/`stop` only โ€” the freeze-and-panic signal. A replayed
242
+ trajectory MUST come from the same `pack:level:seed` (engine actor
243
+ ids are seed-deterministic).
244
  - **`pbox` costs 600** (not the 400 some old specs assumed);
245
  defense and infantry are SEPARATE production queues so an
246
  efficient policy queues `build('pbox')` and `build('e1')` in
openra_bench/handoff.py ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Handoff ablation โ€” hand a model a partially-played game.
2
+
3
+ A handoff episode is split: a `prefix` controller plays the first `k`
4
+ turns, then the model inherits the live game state and finishes it.
5
+ It is a PURE STATE handoff โ€” the model gets no transcript of the
6
+ prefix, only the board it produced ("take over from here").
7
+
8
+ Sweeping the prefix QUALITY decomposes two capabilities:
9
+
10
+ * a **good** prefix (a winning trajectory) โ†’ can the model *capitalize
11
+ on an advantage*? A flat-low outcome curve means it derails even a
12
+ won position.
13
+ * a **bad** prefix (a losing trajectory, or `stall`) โ†’ can the model
14
+ *recover from a deficit*? This is the controlled measurement of the
15
+ freeze-and-panic failure mode: handed a losing board, does the model
16
+ fight (retreat / redirect) or sit on `observe`/`stop`? The
17
+ `passivity` stat on the result quantifies exactly that.
18
+
19
+ The prefix is a recorded run replayed turn-for-turn. Because engine
20
+ actor ids are seed-deterministic, a replayed trajectory MUST come from
21
+ the same `pack:level:seed` as the handoff episode.
22
+ """
23
+
24
+ from __future__ import annotations
25
+
26
+ import json
27
+ from pathlib import Path
28
+ from typing import Any
29
+
30
+ from .controller import BaseController, as_controller, introspection_source
31
+
32
+ # A turn is "passive" when the model issued nothing but these โ€” the
33
+ # freeze-and-panic tell (low-commitment default instead of an active
34
+ # retreat / redirect).
35
+ _PASSIVE_TOOLS = {"observe", "stop"}
36
+
37
+
38
+ def stall_policy(render_state: dict, Command: Any) -> list:
39
+ """The canonical losing prefix: do nothing, every turn. Synthesises
40
+ a guaranteed-deficit handoff with no recorded trajectory needed."""
41
+ return [Command.observe()]
42
+
43
+
44
+ def _load_trajectory(source: Any) -> list[list[dict]]:
45
+ """Per-turn tool-call lists from a recorded run. `source` may be a
46
+ ready list, a Playback directory, or a path to its messages.json."""
47
+ if isinstance(source, list):
48
+ return source
49
+ p = Path(source)
50
+ if p.is_dir():
51
+ p = p / "messages.json"
52
+ msgs = json.loads(p.read_text())
53
+ turns: list[list[dict]] = []
54
+ for m in msgs:
55
+ if m.get("role") != "assistant":
56
+ continue
57
+ calls: list[dict] = []
58
+ for tc in m.get("tool_calls") or []:
59
+ fn = tc.get("function") or {}
60
+ args = fn.get("arguments")
61
+ if isinstance(args, str):
62
+ try:
63
+ args = json.loads(args)
64
+ except (ValueError, TypeError):
65
+ args = {}
66
+ calls.append({"name": fn.get("name"), "arguments": args or {}})
67
+ turns.append(calls)
68
+ return turns
69
+
70
+
71
+ class TrajectoryController(BaseController):
72
+ """Replays a recorded run: turn N re-issues the commands the
73
+ recorded agent issued on its turn N. Past the recording's end it
74
+ falls back to `observe()`. Used as a deterministic handoff prefix โ€”
75
+ a `win`-outcome run is a good prefix, a `loss` is a bad one."""
76
+
77
+ def __init__(self, source: Any, name: str | None = None) -> None:
78
+ super().__init__(name or "trajectory")
79
+ self._turns = _load_trajectory(source)
80
+ self._i = 0
81
+
82
+ def reset(self, ctx: Any) -> None:
83
+ self._i = 0
84
+
85
+ def act(self, observation: dict, Command: Any) -> list:
86
+ from .agent import _to_commands
87
+
88
+ if self._i < len(self._turns):
89
+ calls = self._turns[self._i]
90
+ self._i += 1
91
+ return _to_commands(calls, Command) or [Command.observe()]
92
+ return [Command.observe()]
93
+
94
+
95
+ def _is_passive(cmds: list, _cmd_name) -> bool:
96
+ """A turn with no command other than observe/stop (or no command)."""
97
+ if not cmds:
98
+ return True
99
+ return all((_cmd_name(c) or "") in _PASSIVE_TOOLS for c in cmds)
100
+
101
+
102
+ class HandoffController(BaseController):
103
+ """`prefix` plays turns 0..k-1, then `main` inherits the live state
104
+ and finishes the episode. Pure state handoff โ€” `main` carries no
105
+ transcript of the prefix.
106
+
107
+ `handoff_stats` accumulates, over the MAIN agent's turns only:
108
+ `main_turns`, `passive_turns` (observe/stop-only), and `passivity`
109
+ (their ratio) โ€” the freeze-and-panic signal. When the prefix handed
110
+ `main` a losing position, `passivity` IS passivity-under-pressure."""
111
+
112
+ def __init__(
113
+ self, prefix: Any, main: Any, k: int, name: str | None = None
114
+ ) -> None:
115
+ super().__init__(name or f"handoff-k{int(k)}")
116
+ self._prefix = as_controller(prefix)
117
+ self._main = as_controller(main)
118
+ self._k = max(0, int(k))
119
+ self._turn = 0
120
+ # Playback should record the MAIN agent's transcript, not this
121
+ # wrapper's โ€” expose it as the introspection source.
122
+ self.source = introspection_source(self._main)
123
+ self.handoff_stats = self._fresh_stats()
124
+
125
+ def _fresh_stats(self) -> dict:
126
+ return {
127
+ "k": self._k, "main_turns": 0,
128
+ "passive_turns": 0, "passivity": 0.0,
129
+ }
130
+
131
+ def reset(self, ctx: Any) -> None:
132
+ self._turn = 0
133
+ self._prefix.reset(ctx)
134
+ self._main.reset(ctx)
135
+ self.handoff_stats = self._fresh_stats()
136
+
137
+ def act(self, observation: dict, Command: Any) -> list:
138
+ if self._turn < self._k:
139
+ self._turn += 1
140
+ return self._prefix.act(observation, Command)
141
+ cmds = self._main.act(observation, Command)
142
+ self._turn += 1
143
+ from .eval_core import _cmd_tool_name
144
+
145
+ st = self.handoff_stats
146
+ st["main_turns"] += 1
147
+ if _is_passive(cmds, _cmd_tool_name):
148
+ st["passive_turns"] += 1
149
+ st["passivity"] = st["passive_turns"] / st["main_turns"]
150
+ return cmds
151
+
152
+
153
+ def run_handoff(
154
+ compiled: Any, main: Any, prefix: Any, k: int,
155
+ seed: int = 0, playback: Any = None,
156
+ ):
157
+ """Run a handoff episode: `prefix` plays the first `k` turns, `main`
158
+ finishes. Returns the `EpisodeResult` with `.handoff_stats` attached
159
+ (k, main_turns, passive_turns, passivity)."""
160
+ from .eval_core import run_level
161
+
162
+ ctrl = HandoffController(prefix, main, k)
163
+ res = run_level(compiled, ctrl, seed, playback)
164
+ res.handoff_stats = dict(ctrl.handoff_stats)
165
+ return res
openra_bench/run_eval.py CHANGED
@@ -99,6 +99,50 @@ def _agg(scores: list) -> dict:
99
  }
100
 
101
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
  def evaluate(
103
  packs: list[Path],
104
  levels: list[str],
@@ -118,6 +162,9 @@ def evaluate(
118
  report_path: str | Path | None = None,
119
  progress=None,
120
  perception_sweep: bool = False,
 
 
 
121
  ) -> dict:
122
  """Run packsร—levelsร—seeds. If `held_out_seeds` is given, those are
123
  run too and tagged split='held_out'; the report adds
@@ -129,6 +176,14 @@ def evaluate(
129
  ablation cells (`pack:level:<mode>` for mode in PERCEPTION_MODES โ€”
130
  vision/structured ร— fog/no-fog) instead of the raw 3 levels, so one
131
  run yields the full channel-cost / fog-cost decomposition.
 
 
 
 
 
 
 
 
132
  """
133
  from .resilience import (
134
  BudgetExceeded,
@@ -220,6 +275,16 @@ def evaluate(
220
  cl.fog_mode = mode
221
  cl.config_name = f"{lv}:{mode}"
222
  unit_iter.append((cl, f"{pack.meta.id}:{lv}:{mode}"))
 
 
 
 
 
 
 
 
 
 
223
  # Declared configs (pack:config_name, each pins level+fog_mode)
224
  # supersede the raw 3-level enumeration when present.
225
  elif pack.configs:
@@ -258,7 +323,19 @@ def evaluate(
258
  seed,
259
  )
260
  pb.run_id, pb.model = run_id, model
261
- res = run_level(compiled, factory(compiled), seed=seed, playback=pb)
 
 
 
 
 
 
 
 
 
 
 
 
262
  sc = score_episode(compiled, res)
263
  if pb is not None:
264
  (pb.dir / "score.json").write_text(
@@ -292,6 +369,8 @@ def evaluate(
292
  "reward_vector": res.reward_vector,
293
  "turns": res.turns,
294
  "notes": sc.notes,
 
 
295
  "_sc": sc,
296
  }
297
 
@@ -600,6 +679,15 @@ def main(argv: list[str]) -> int:
600
  help="run the 2x2 perception ablation: every "
601
  "pack:level expanded into vision/structured x "
602
  "fog/no-fog (pack:level:<mode>)")
 
 
 
 
 
 
 
 
 
603
  a = ap.parse_args(argv[1:])
604
 
605
  cfg = None
@@ -642,6 +730,9 @@ def main(argv: list[str]) -> int:
642
  dry_run=a.dry_run,
643
  report_path=a.out,
644
  perception_sweep=a.perception_sweep,
 
 
 
645
  progress=lambda d, n, rec, c: print(
646
  f"[{d}/{n}] {rec['cell']}:{rec['split']}#{rec['seed']} "
647
  f"{rec['outcome']} comp={rec['composite']} "
 
99
  }
100
 
101
 
102
+ def _find_win_trajectory(bank: str | Path, cell: str, seed: int) -> str | None:
103
+ """Path to a winning run's messages.json for this cell+seed, scanned
104
+ from a `--handoff-bank` directory of Playback runs โ€” the good-prefix
105
+ source. None when the bank holds no matching win. (Engine actor ids
106
+ are seed-deterministic, so the trajectory must match pack/level/seed
107
+ for a faithful replay.)"""
108
+ base = cell.rsplit(":handoff-", 1)[0] # "pack:level"
109
+ pack_id, _, level = base.partition(":")
110
+ for mf in sorted(Path(bank).rglob("manifest.json")):
111
+ try:
112
+ m = json.loads(mf.read_text())
113
+ except (ValueError, OSError):
114
+ continue
115
+ if (
116
+ str(m.get("pack_id")) == pack_id
117
+ and str(m.get("level")) == level
118
+ and int(m.get("seed", -1)) == int(seed)
119
+ and str(m.get("outcome")) == "win"
120
+ and (mf.parent / "messages.json").exists()
121
+ ):
122
+ return str(mf.parent / "messages.json")
123
+ return None
124
+
125
+
126
+ def _handoff_wrap(agent, cell: str, seed: int, k: int, bank):
127
+ """Wrap `agent` in a HandoffController for a `:handoff-<kind>` cell.
128
+ Returns (controller, note)."""
129
+ from .handoff import HandoffController, TrajectoryController, stall_policy
130
+
131
+ kind = cell.rsplit(":handoff-", 1)[1]
132
+ if kind == "bad": # losing prefix โ€” the recovery / freeze test
133
+ return HandoffController(stall_policy, agent, k), ""
134
+ if kind == "good": # winning prefix โ€” capitalize-on-advantage
135
+ traj = _find_win_trajectory(bank, cell, seed) if bank else None
136
+ if traj is None:
137
+ return (
138
+ HandoffController(stall_policy, agent, 0),
139
+ f"no winning trajectory in bank for seed {seed} โ€” ran as base",
140
+ )
141
+ return HandoffController(TrajectoryController(traj), agent, k), ""
142
+ # base โ€” k=0; the model plays the whole episode (baseline passivity).
143
+ return HandoffController(stall_policy, agent, 0), ""
144
+
145
+
146
  def evaluate(
147
  packs: list[Path],
148
  levels: list[str],
 
162
  report_path: str | Path | None = None,
163
  progress=None,
164
  perception_sweep: bool = False,
165
+ handoff_sweep: bool = False,
166
+ handoff_k: int = 3,
167
+ handoff_bank: str | Path | None = None,
168
  ) -> dict:
169
  """Run packsร—levelsร—seeds. If `held_out_seeds` is given, those are
170
  run too and tagged split='held_out'; the report adds
 
176
  ablation cells (`pack:level:<mode>` for mode in PERCEPTION_MODES โ€”
177
  vision/structured ร— fog/no-fog) instead of the raw 3 levels, so one
178
  run yields the full channel-cost / fog-cost decomposition.
179
+
180
+ `handoff_sweep` expands every packร—level into handoff cells
181
+ (`pack:level:handoff-{base,bad,good}`): the model plays the whole
182
+ episode (`base`), or inherits a losing position after a `stall`
183
+ prefix (`bad` โ€” the recovery / freeze-and-panic test), or a winning
184
+ position replayed from a `handoff_bank` trajectory (`good` โ€” the
185
+ capitalize-on-advantage test). `handoff_k` is the prefix length.
186
+ Each record carries a `passivity` stat (observe/stop-only fraction).
187
  """
188
  from .resilience import (
189
  BudgetExceeded,
 
275
  cl.fog_mode = mode
276
  cl.config_name = f"{lv}:{mode}"
277
  unit_iter.append((cl, f"{pack.meta.id}:{lv}:{mode}"))
278
+ # Handoff sweep: each level as base / bad / good handoff cells.
279
+ # `good` needs a winning trajectory from the bank โ€” emitted only
280
+ # when a bank is supplied; `base`/`bad` always run.
281
+ elif handoff_sweep:
282
+ kinds = ["base", "bad"] + (["good"] if handoff_bank else [])
283
+ unit_iter = [
284
+ (compile_level(pack, lv), f"{pack.meta.id}:{lv}:handoff-{kind}")
285
+ for lv in levels
286
+ for kind in kinds
287
+ ]
288
  # Declared configs (pack:config_name, each pins level+fog_mode)
289
  # supersede the raw 3-level enumeration when present.
290
  elif pack.configs:
 
323
  seed,
324
  )
325
  pb.run_id, pb.model = run_id, model
326
+ ctrl = factory(compiled)
327
+ if handoff_sweep and ":handoff-" in cell:
328
+ ctrl, _hnote = _handoff_wrap(
329
+ ctrl, cell, seed, handoff_k, handoff_bank
330
+ )
331
+ else:
332
+ _hnote = ""
333
+ res = run_level(compiled, ctrl, seed=seed, playback=pb)
334
+ hstats = getattr(ctrl, "handoff_stats", None)
335
+ if hstats is not None:
336
+ hstats = dict(hstats)
337
+ if _hnote:
338
+ hstats["note"] = _hnote
339
  sc = score_episode(compiled, res)
340
  if pb is not None:
341
  (pb.dir / "score.json").write_text(
 
369
  "reward_vector": res.reward_vector,
370
  "turns": res.turns,
371
  "notes": sc.notes,
372
+ "passivity": hstats.get("passivity") if hstats else None,
373
+ "handoff": hstats,
374
  "_sc": sc,
375
  }
376
 
 
679
  help="run the 2x2 perception ablation: every "
680
  "pack:level expanded into vision/structured x "
681
  "fog/no-fog (pack:level:<mode>)")
682
+ ap.add_argument("--handoff-sweep", action="store_true",
683
+ help="run the handoff ablation: each pack:level as "
684
+ "handoff-base / handoff-bad (recovery) / handoff-good "
685
+ "(capitalize) cells")
686
+ ap.add_argument("--handoff-k", type=int, default=3,
687
+ help="handoff prefix length in turns (default 3)")
688
+ ap.add_argument("--handoff-bank", default=None,
689
+ help="dir of Playback runs โ€” source of winning "
690
+ "trajectories for the handoff-good prefix")
691
  a = ap.parse_args(argv[1:])
692
 
693
  cfg = None
 
730
  dry_run=a.dry_run,
731
  report_path=a.out,
732
  perception_sweep=a.perception_sweep,
733
+ handoff_sweep=a.handoff_sweep,
734
+ handoff_k=a.handoff_k,
735
+ handoff_bank=a.handoff_bank,
736
  progress=lambda d, n, rec, c: print(
737
  f"[{d}/{n}] {rec['cell']}:{rec['split']}#{rec['seed']} "
738
  f"{rec['outcome']} comp={rec['composite']} "
tests/test_handoff_ablation.py ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """The handoff ablation โ€” hand a model a partially-played game.
2
+
3
+ A `prefix` controller plays the first K turns, then the model inherits
4
+ the live state. A GOOD prefix (winning trajectory) tests
5
+ capitalize-on-advantage; a BAD prefix (`stall`) tests recovery โ€” and
6
+ the `passivity` stat (observe/stop-only turns) quantifies the
7
+ freeze-and-panic failure mode the recovery cell is built to expose.
8
+ """
9
+
10
+ from __future__ import annotations
11
+
12
+ import json
13
+ from pathlib import Path
14
+
15
+ import pytest
16
+
17
+ pytest.importorskip("openra_rl_training", reason="Rust env wheel not installed")
18
+
19
+ from openra_bench.handoff import (HandoffController, TrajectoryController,
20
+ _load_trajectory, run_handoff, stall_policy)
21
+ from openra_bench.scenarios import load_pack
22
+ from openra_bench.scenarios.loader import compile_level
23
+
24
+ PACKS = Path(__file__).parent.parent / "openra_bench" / "scenarios" / "packs"
25
+ _PACK = "perception-count-the-threat.yaml"
26
+
27
+
28
+ def _compiled(level: str = "easy"):
29
+ return compile_level(load_pack(PACKS / _PACK), level)
30
+
31
+
32
+ # โ”€โ”€ Trajectory loading / replay โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
33
+
34
+ def test_load_trajectory_list_passthrough():
35
+ traj = [[{"name": "observe", "arguments": {}}]]
36
+ assert _load_trajectory(traj) is traj
37
+
38
+
39
+ def test_load_trajectory_from_messages_json(tmp_path):
40
+ msgs = [
41
+ {"role": "system", "content": "x"},
42
+ {"role": "user", "content": "turn 1"},
43
+ {"role": "assistant", "content": "", "tool_calls": [
44
+ {"id": "c0", "type": "function", "function": {
45
+ "name": "move_units",
46
+ "arguments": {"unit_ids": [1], "target_x": 5, "target_y": 5},
47
+ }}]},
48
+ {"role": "tool", "tool_call_id": "c0", "content": "ok"},
49
+ {"role": "user", "content": "turn 2"},
50
+ {"role": "assistant", "content": "", "tool_calls": [
51
+ {"id": "c0", "type": "function",
52
+ "function": {"name": "observe", "arguments": {}}}]},
53
+ ]
54
+ p = tmp_path / "messages.json"
55
+ p.write_text(json.dumps(msgs))
56
+ turns = _load_trajectory(p)
57
+ assert len(turns) == 2
58
+ assert turns[0][0]["name"] == "move_units"
59
+ assert turns[1][0]["name"] == "observe"
60
+
61
+
62
+ def test_trajectory_controller_replays_then_falls_back():
63
+ import openra_train
64
+
65
+ C = openra_train.Command
66
+ tc = TrajectoryController([
67
+ [{"name": "observe", "arguments": {}}],
68
+ [{"name": "stop", "arguments": {"unit_ids": [1]}}],
69
+ ])
70
+ tc.reset(None)
71
+ assert "Observe" in repr(tc.act({}, C)[0])
72
+ assert "Stop" in repr(tc.act({}, C)[0])
73
+ # past the recording's end โ†’ observe
74
+ assert "Observe" in repr(tc.act({}, C)[0])
75
+
76
+
77
+ # โ”€โ”€ Handoff switch + passivity โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
78
+
79
+ def test_handoff_switches_prefix_to_main_at_k():
80
+ pcalls, mcalls = [], []
81
+
82
+ def prefix(rs, C):
83
+ pcalls.append(1)
84
+ return [C.observe()]
85
+
86
+ def main(rs, C):
87
+ mcalls.append(1)
88
+ return [C.observe()]
89
+
90
+ res = run_handoff(_compiled("easy"), main=main, prefix=prefix, k=3, seed=1)
91
+ assert len(pcalls) == 3, "prefix must play exactly k turns"
92
+ assert len(mcalls) == res.turns - 3, "main plays the remainder"
93
+ assert res.handoff_stats["k"] == 3
94
+ assert res.handoff_stats["main_turns"] == len(mcalls)
95
+
96
+
97
+ def test_passivity_is_one_when_main_freezes():
98
+ """A main that only ever observes scores passivity 1.0 โ€” the
99
+ freeze-and-panic signal; an active policy scores low."""
100
+ from openra_bench.eval_core import scripted_explore_agent
101
+
102
+ frozen = run_handoff(
103
+ _compiled("medium"), main=stall_policy, prefix=stall_policy,
104
+ k=2, seed=1,
105
+ )
106
+ assert frozen.handoff_stats["passivity"] == 1.0
107
+
108
+ active = run_handoff(
109
+ _compiled("medium"), main=scripted_explore_agent,
110
+ prefix=stall_policy, k=2, seed=1,
111
+ )
112
+ assert active.handoff_stats["passivity"] < 0.5
113
+
114
+
115
+ def test_k0_handoff_is_a_full_episode():
116
+ from openra_bench.eval_core import scripted_explore_agent
117
+
118
+ res = run_handoff(
119
+ _compiled("easy"), main=scripted_explore_agent,
120
+ prefix=stall_policy, k=0, seed=1,
121
+ )
122
+ assert res.handoff_stats["main_turns"] == res.turns
123
+
124
+
125
+ # โ”€โ”€ Sweep wiring โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
126
+
127
+ def test_handoff_sweep_expands_base_and_bad_cells():
128
+ from openra_bench.run_eval import evaluate
129
+
130
+ out = evaluate(
131
+ [PACKS / _PACK], levels=["easy"], seeds=[1],
132
+ handoff_sweep=True, dry_run=True,
133
+ )
134
+ assert set(out["cells"]) == {
135
+ "perception-count-the-threat:easy:handoff-base",
136
+ "perception-count-the-threat:easy:handoff-bad",
137
+ }
138
+
139
+
140
+ def test_find_win_trajectory_matches_a_banked_win(tmp_path):
141
+ from openra_bench.run_eval import _find_win_trajectory
142
+
143
+ d = tmp_path / "run" / "p__seed1"
144
+ d.mkdir(parents=True)
145
+ (d / "manifest.json").write_text(json.dumps(
146
+ {"pack_id": "p", "level": "easy", "seed": 1, "outcome": "win"}))
147
+ (d / "messages.json").write_text("[]")
148
+ assert _find_win_trajectory(
149
+ tmp_path, "p:easy:handoff-good", 1
150
+ ) == str(d / "messages.json")
151
+ # a different seed / a loss is not matched
152
+ assert _find_win_trajectory(tmp_path, "p:easy:handoff-good", 2) is None