Spaces:

qpluslab
/

OpenRA-Bench

Running

yxc20098 commited on May 22

Commit

cb15568

1 Parent(s): 4a5b0dd

Add handoff ablation (recover-from-deficit / capitalize-on-advantage)

A `prefix` controller plays the first K turns of an episode, then the
model inherits the live game state and finishes it — a pure STATE
handoff ("take over from here"), no transcript carried over, so it
stays orthogonal to the in-context-learning axis.

Sweeping the prefix quality decomposes a reasoning capability:
- a losing prefix (`stall`) hands the model a deficit — the recovery
test;
- a winning prefix (a replayed winning trajectory) hands it an
advantage — the capitalize test.

Every result carries a `passivity` stat — the fraction of the model's
turns spent on `observe`/`stop` only. Under the bad-prefix deficit
that is passivity-under-pressure: a number for the freeze-and-panic
failure mode (when losing, models tend to stop/observe instead of
ordering an active retreat or redirect).

- handoff.py: TrajectoryController (replays a recorded run's per-turn
tool_calls from messages.json), HandoffController (k-turn prefix →
main switch + passivity tracking), run_handoff helper.
- run_eval.py: --handoff-sweep expands each pack:level into
handoff-{base,bad,good} cells; --handoff-k, --handoff-bank.

Pinned by tests/test_handoff_ablation.py.

Files changed (4) hide show

CLAUDE.md +13 -0
openra_bench/handoff.py +165 -0
openra_bench/run_eval.py +92 -1
tests/test_handoff_ablation.py +152 -0

CLAUDE.md CHANGED Viewed

@@ -228,6 +228,19 @@ A scenario is defective if any of the following hold:
   the full grid with `run_eval --perception-sweep` (expands every
   `pack:level` into `pack:level:<mode>`); the human Play tab stays
   on the canonical `vision` (fogged) modality.
 - **`pbox` costs 600** (not the 400 some old specs assumed);
   defense and infantry are SEPARATE production queues so an
   efficient policy queues `build('pbox')` and `build('e1')` in

   the full grid with `run_eval --perception-sweep` (expands every
   `pack:level` into `pack:level:<mode>`); the human Play tab stays
   on the canonical `vision` (fogged) modality.
+- **Handoff ablation** (`openra_bench/handoff.py`, `run_eval
+  --handoff-sweep`). A `HandoffController` lets a `prefix` controller
+  play the first K turns, then the model inherits the live game state
+  ("take over from here" — a pure STATE handoff, no transcript). A
+  `stall` prefix hands the model a losing position (recovery test); a
+  replayed winning trajectory (`TrajectoryController`, sourced from a
+  `--handoff-bank` of Playback runs) hands it a winning one
+  (capitalize-on-advantage). Sweep cells are
+  `pack:level:handoff-{base,bad,good}`. Every result carries a
+  `passivity` stat — the fraction of the model's turns spent on
+  `observe`/`stop` only — the freeze-and-panic signal. A replayed
+  trajectory MUST come from the same `pack:level:seed` (engine actor
+  ids are seed-deterministic).
 - **`pbox` costs 600** (not the 400 some old specs assumed);
   defense and infantry are SEPARATE production queues so an
   efficient policy queues `build('pbox')` and `build('e1')` in

openra_bench/handoff.py ADDED Viewed

	@@ -0,0 +1,165 @@

+"""Handoff ablation — hand a model a partially-played game.
+A handoff episode is split: a `prefix` controller plays the first `k`
+turns, then the model inherits the live game state and finishes it.
+It is a PURE STATE handoff — the model gets no transcript of the
+prefix, only the board it produced ("take over from here").
+Sweeping the prefix QUALITY decomposes two capabilities:
+* a **good** prefix (a winning trajectory) → can the model *capitalize
+  on an advantage*? A flat-low outcome curve means it derails even a
+  won position.
+* a **bad** prefix (a losing trajectory, or `stall`) → can the model
+  *recover from a deficit*? This is the controlled measurement of the
+  freeze-and-panic failure mode: handed a losing board, does the model
+  fight (retreat / redirect) or sit on `observe`/`stop`? The
+  `passivity` stat on the result quantifies exactly that.
+The prefix is a recorded run replayed turn-for-turn. Because engine
+actor ids are seed-deterministic, a replayed trajectory MUST come from
+the same `pack:level:seed` as the handoff episode.
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Any
+from .controller import BaseController, as_controller, introspection_source
+# A turn is "passive" when the model issued nothing but these — the
+# freeze-and-panic tell (low-commitment default instead of an active
+# retreat / redirect).
+_PASSIVE_TOOLS = {"observe", "stop"}
+def stall_policy(render_state: dict, Command: Any) -> list:
+    """The canonical losing prefix: do nothing, every turn. Synthesises
+    a guaranteed-deficit handoff with no recorded trajectory needed."""
+    return [Command.observe()]
+def _load_trajectory(source: Any) -> list[list[dict]]:
+    """Per-turn tool-call lists from a recorded run. `source` may be a
+    ready list, a Playback directory, or a path to its messages.json."""
+    if isinstance(source, list):
+        return source
+    p = Path(source)
+    if p.is_dir():
+        p = p / "messages.json"
+    msgs = json.loads(p.read_text())
+    turns: list[list[dict]] = []
+    for m in msgs:
+        if m.get("role") != "assistant":
+            continue
+        calls: list[dict] = []
+        for tc in m.get("tool_calls") or []:
+            fn = tc.get("function") or {}
+            args = fn.get("arguments")
+            if isinstance(args, str):
+                try:
+                    args = json.loads(args)
+                except (ValueError, TypeError):
+                    args = {}
+            calls.append({"name": fn.get("name"), "arguments": args or {}})
+        turns.append(calls)
+    return turns
+class TrajectoryController(BaseController):
+    """Replays a recorded run: turn N re-issues the commands the
+    recorded agent issued on its turn N. Past the recording's end it
+    falls back to `observe()`. Used as a deterministic handoff prefix —
+    a `win`-outcome run is a good prefix, a `loss` is a bad one."""
+    def __init__(self, source: Any, name: str | None = None) -> None:
+        super().__init__(name or "trajectory")
+        self._turns = _load_trajectory(source)
+        self._i = 0
+    def reset(self, ctx: Any) -> None:
+        self._i = 0
+    def act(self, observation: dict, Command: Any) -> list:
+        from .agent import _to_commands
+        if self._i < len(self._turns):
+            calls = self._turns[self._i]
+            self._i += 1
+            return _to_commands(calls, Command) or [Command.observe()]
+        return [Command.observe()]
+def _is_passive(cmds: list, _cmd_name) -> bool:
+    """A turn with no command other than observe/stop (or no command)."""
+    if not cmds:
+        return True
+    return all((_cmd_name(c) or "") in _PASSIVE_TOOLS for c in cmds)
+class HandoffController(BaseController):
+    """`prefix` plays turns 0..k-1, then `main` inherits the live state
+    and finishes the episode. Pure state handoff — `main` carries no
+    transcript of the prefix.
+    `handoff_stats` accumulates, over the MAIN agent's turns only:
+    `main_turns`, `passive_turns` (observe/stop-only), and `passivity`
+    (their ratio) — the freeze-and-panic signal. When the prefix handed
+    `main` a losing position, `passivity` IS passivity-under-pressure."""
+    def __init__(
+        self, prefix: Any, main: Any, k: int, name: str | None = None
+    ) -> None:
+        super().__init__(name or f"handoff-k{int(k)}")
+        self._prefix = as_controller(prefix)
+        self._main = as_controller(main)
+        self._k = max(0, int(k))
+        self._turn = 0
+        # Playback should record the MAIN agent's transcript, not this
+        # wrapper's — expose it as the introspection source.
+        self.source = introspection_source(self._main)
+        self.handoff_stats = self._fresh_stats()
+    def _fresh_stats(self) -> dict:
+        return {
+            "k": self._k, "main_turns": 0,
+            "passive_turns": 0, "passivity": 0.0,
+        }
+    def reset(self, ctx: Any) -> None:
+        self._turn = 0
+        self._prefix.reset(ctx)
+        self._main.reset(ctx)
+        self.handoff_stats = self._fresh_stats()
+    def act(self, observation: dict, Command: Any) -> list:
+        if self._turn < self._k:
+            self._turn += 1
+            return self._prefix.act(observation, Command)
+        cmds = self._main.act(observation, Command)
+        self._turn += 1
+        from .eval_core import _cmd_tool_name
+        st = self.handoff_stats
+        st["main_turns"] += 1
+        if _is_passive(cmds, _cmd_tool_name):
+            st["passive_turns"] += 1
+        st["passivity"] = st["passive_turns"] / st["main_turns"]
+        return cmds
+def run_handoff(
+    compiled: Any, main: Any, prefix: Any, k: int,
+    seed: int = 0, playback: Any = None,
+):
+    """Run a handoff episode: `prefix` plays the first `k` turns, `main`
+    finishes. Returns the `EpisodeResult` with `.handoff_stats` attached
+    (k, main_turns, passive_turns, passivity)."""
+    from .eval_core import run_level
+    ctrl = HandoffController(prefix, main, k)
+    res = run_level(compiled, ctrl, seed, playback)
+    res.handoff_stats = dict(ctrl.handoff_stats)
+    return res

openra_bench/run_eval.py CHANGED Viewed

@@ -99,6 +99,50 @@ def _agg(scores: list) -> dict:
     }
 def evaluate(
     packs: list[Path],
     levels: list[str],
@@ -118,6 +162,9 @@ def evaluate(
     report_path: str | Path | None = None,
     progress=None,
     perception_sweep: bool = False,
 ) -> dict:
     """Run packs×levels×seeds. If `held_out_seeds` is given, those are
     run too and tagged split='held_out'; the report adds
@@ -129,6 +176,14 @@ def evaluate(
     ablation cells (`pack:level:<mode>` for mode in PERCEPTION_MODES —
     vision/structured × fog/no-fog) instead of the raw 3 levels, so one
     run yields the full channel-cost / fog-cost decomposition.
     """
     from .resilience import (
         BudgetExceeded,
@@ -220,6 +275,16 @@ def evaluate(
                     cl.fog_mode = mode
                     cl.config_name = f"{lv}:{mode}"
                     unit_iter.append((cl, f"{pack.meta.id}:{lv}:{mode}"))
         # Declared configs (pack:config_name, each pins level+fog_mode)
         # supersede the raw 3-level enumeration when present.
         elif pack.configs:
@@ -258,7 +323,19 @@ def evaluate(
                 seed,
             )
             pb.run_id, pb.model = run_id, model
-        res = run_level(compiled, factory(compiled), seed=seed, playback=pb)
         sc = score_episode(compiled, res)
         if pb is not None:
             (pb.dir / "score.json").write_text(
@@ -292,6 +369,8 @@ def evaluate(
             "reward_vector": res.reward_vector,
             "turns": res.turns,
             "notes": sc.notes,
             "_sc": sc,
         }
@@ -600,6 +679,15 @@ def main(argv: list[str]) -> int:
                     help="run the 2x2 perception ablation: every "
                     "pack:level expanded into vision/structured x "
                     "fog/no-fog (pack:level:<mode>)")
     a = ap.parse_args(argv[1:])
     cfg = None
@@ -642,6 +730,9 @@ def main(argv: list[str]) -> int:
         dry_run=a.dry_run,
         report_path=a.out,
         perception_sweep=a.perception_sweep,
         progress=lambda d, n, rec, c: print(
             f"[{d}/{n}] {rec['cell']}:{rec['split']}#{rec['seed']} "
             f"{rec['outcome']} comp={rec['composite']} "

     }
+def _find_win_trajectory(bank: str | Path, cell: str, seed: int) -> str | None:
+    """Path to a winning run's messages.json for this cell+seed, scanned
+    from a `--handoff-bank` directory of Playback runs — the good-prefix
+    source. None when the bank holds no matching win. (Engine actor ids
+    are seed-deterministic, so the trajectory must match pack/level/seed
+    for a faithful replay.)"""
+    base = cell.rsplit(":handoff-", 1)[0]  # "pack:level"
+    pack_id, _, level = base.partition(":")
+    for mf in sorted(Path(bank).rglob("manifest.json")):
+        try:
+            m = json.loads(mf.read_text())
+        except (ValueError, OSError):
+            continue
+        if (
+            str(m.get("pack_id")) == pack_id
+            and str(m.get("level")) == level
+            and int(m.get("seed", -1)) == int(seed)
+            and str(m.get("outcome")) == "win"
+            and (mf.parent / "messages.json").exists()
+        ):
+            return str(mf.parent / "messages.json")
+    return None
+def _handoff_wrap(agent, cell: str, seed: int, k: int, bank):
+    """Wrap `agent` in a HandoffController for a `:handoff-<kind>` cell.
+    Returns (controller, note)."""
+    from .handoff import HandoffController, TrajectoryController, stall_policy
+    kind = cell.rsplit(":handoff-", 1)[1]
+    if kind == "bad":  # losing prefix — the recovery / freeze test
+        return HandoffController(stall_policy, agent, k), ""
+    if kind == "good":  # winning prefix — capitalize-on-advantage
+        traj = _find_win_trajectory(bank, cell, seed) if bank else None
+        if traj is None:
+            return (
+                HandoffController(stall_policy, agent, 0),
+                f"no winning trajectory in bank for seed {seed} — ran as base",
+            )
+        return HandoffController(TrajectoryController(traj), agent, k), ""
+    # base — k=0; the model plays the whole episode (baseline passivity).
+    return HandoffController(stall_policy, agent, 0), ""
 def evaluate(
     packs: list[Path],
     levels: list[str],
     report_path: str | Path | None = None,
     progress=None,
     perception_sweep: bool = False,
+    handoff_sweep: bool = False,
+    handoff_k: int = 3,
+    handoff_bank: str | Path | None = None,
 ) -> dict:
     """Run packs×levels×seeds. If `held_out_seeds` is given, those are
     run too and tagged split='held_out'; the report adds
     ablation cells (`pack:level:<mode>` for mode in PERCEPTION_MODES —
     vision/structured × fog/no-fog) instead of the raw 3 levels, so one
     run yields the full channel-cost / fog-cost decomposition.
+    `handoff_sweep` expands every pack×level into handoff cells
+    (`pack:level:handoff-{base,bad,good}`): the model plays the whole
+    episode (`base`), or inherits a losing position after a `stall`
+    prefix (`bad` — the recovery / freeze-and-panic test), or a winning
+    position replayed from a `handoff_bank` trajectory (`good` — the
+    capitalize-on-advantage test). `handoff_k` is the prefix length.
+    Each record carries a `passivity` stat (observe/stop-only fraction).
     """
     from .resilience import (
         BudgetExceeded,
                     cl.fog_mode = mode
                     cl.config_name = f"{lv}:{mode}"
                     unit_iter.append((cl, f"{pack.meta.id}:{lv}:{mode}"))
+        # Handoff sweep: each level as base / bad / good handoff cells.
+        # `good` needs a winning trajectory from the bank — emitted only
+        # when a bank is supplied; `base`/`bad` always run.
+        elif handoff_sweep:
+            kinds = ["base", "bad"] + (["good"] if handoff_bank else [])
+            unit_iter = [
+                (compile_level(pack, lv), f"{pack.meta.id}:{lv}:handoff-{kind}")
+                for lv in levels
+                for kind in kinds
+            ]
         # Declared configs (pack:config_name, each pins level+fog_mode)
         # supersede the raw 3-level enumeration when present.
         elif pack.configs:
                 seed,
             )
             pb.run_id, pb.model = run_id, model
+        ctrl = factory(compiled)
+        if handoff_sweep and ":handoff-" in cell:
+            ctrl, _hnote = _handoff_wrap(
+                ctrl, cell, seed, handoff_k, handoff_bank
+            )
+        else:
+            _hnote = ""
+        res = run_level(compiled, ctrl, seed=seed, playback=pb)
+        hstats = getattr(ctrl, "handoff_stats", None)
+        if hstats is not None:
+            hstats = dict(hstats)
+            if _hnote:
+                hstats["note"] = _hnote
         sc = score_episode(compiled, res)
         if pb is not None:
             (pb.dir / "score.json").write_text(
             "reward_vector": res.reward_vector,
             "turns": res.turns,
             "notes": sc.notes,
+            "passivity": hstats.get("passivity") if hstats else None,
+            "handoff": hstats,
             "_sc": sc,
         }
                     help="run the 2x2 perception ablation: every "
                     "pack:level expanded into vision/structured x "
                     "fog/no-fog (pack:level:<mode>)")
+    ap.add_argument("--handoff-sweep", action="store_true",
+                    help="run the handoff ablation: each pack:level as "
+                    "handoff-base / handoff-bad (recovery) / handoff-good "
+                    "(capitalize) cells")
+    ap.add_argument("--handoff-k", type=int, default=3,
+                    help="handoff prefix length in turns (default 3)")
+    ap.add_argument("--handoff-bank", default=None,
+                    help="dir of Playback runs — source of winning "
+                    "trajectories for the handoff-good prefix")
     a = ap.parse_args(argv[1:])
     cfg = None
         dry_run=a.dry_run,
         report_path=a.out,
         perception_sweep=a.perception_sweep,
+        handoff_sweep=a.handoff_sweep,
+        handoff_k=a.handoff_k,
+        handoff_bank=a.handoff_bank,
         progress=lambda d, n, rec, c: print(
             f"[{d}/{n}] {rec['cell']}:{rec['split']}#{rec['seed']} "
             f"{rec['outcome']} comp={rec['composite']} "

tests/test_handoff_ablation.py ADDED Viewed

	@@ -0,0 +1,152 @@

+"""The handoff ablation — hand a model a partially-played game.
+A `prefix` controller plays the first K turns, then the model inherits
+the live state. A GOOD prefix (winning trajectory) tests
+capitalize-on-advantage; a BAD prefix (`stall`) tests recovery — and
+the `passivity` stat (observe/stop-only turns) quantifies the
+freeze-and-panic failure mode the recovery cell is built to expose.
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+import pytest
+pytest.importorskip("openra_rl_training", reason="Rust env wheel not installed")
+from openra_bench.handoff import (HandoffController, TrajectoryController,
+                                  _load_trajectory, run_handoff, stall_policy)
+from openra_bench.scenarios import load_pack
+from openra_bench.scenarios.loader import compile_level
+PACKS = Path(__file__).parent.parent / "openra_bench" / "scenarios" / "packs"
+_PACK = "perception-count-the-threat.yaml"
+def _compiled(level: str = "easy"):
+    return compile_level(load_pack(PACKS / _PACK), level)
+# ── Trajectory loading / replay ──────────────────────────────────────
+def test_load_trajectory_list_passthrough():
+    traj = [[{"name": "observe", "arguments": {}}]]
+    assert _load_trajectory(traj) is traj
+def test_load_trajectory_from_messages_json(tmp_path):
+    msgs = [
+        {"role": "system", "content": "x"},
+        {"role": "user", "content": "turn 1"},
+        {"role": "assistant", "content": "", "tool_calls": [
+            {"id": "c0", "type": "function", "function": {
+                "name": "move_units",
+                "arguments": {"unit_ids": [1], "target_x": 5, "target_y": 5},
+            }}]},
+        {"role": "tool", "tool_call_id": "c0", "content": "ok"},
+        {"role": "user", "content": "turn 2"},
+        {"role": "assistant", "content": "", "tool_calls": [
+            {"id": "c0", "type": "function",
+             "function": {"name": "observe", "arguments": {}}}]},
+    ]
+    p = tmp_path / "messages.json"
+    p.write_text(json.dumps(msgs))
+    turns = _load_trajectory(p)
+    assert len(turns) == 2
+    assert turns[0][0]["name"] == "move_units"
+    assert turns[1][0]["name"] == "observe"
+def test_trajectory_controller_replays_then_falls_back():
+    import openra_train
+    C = openra_train.Command
+    tc = TrajectoryController([
+        [{"name": "observe", "arguments": {}}],
+        [{"name": "stop", "arguments": {"unit_ids": [1]}}],
+    ])
+    tc.reset(None)
+    assert "Observe" in repr(tc.act({}, C)[0])
+    assert "Stop" in repr(tc.act({}, C)[0])
+    # past the recording's end → observe
+    assert "Observe" in repr(tc.act({}, C)[0])
+# ── Handoff switch + passivity ───────────────────────────────────────
+def test_handoff_switches_prefix_to_main_at_k():
+    pcalls, mcalls = [], []
+    def prefix(rs, C):
+        pcalls.append(1)
+        return [C.observe()]
+    def main(rs, C):
+        mcalls.append(1)
+        return [C.observe()]
+    res = run_handoff(_compiled("easy"), main=main, prefix=prefix, k=3, seed=1)
+    assert len(pcalls) == 3, "prefix must play exactly k turns"
+    assert len(mcalls) == res.turns - 3, "main plays the remainder"
+    assert res.handoff_stats["k"] == 3
+    assert res.handoff_stats["main_turns"] == len(mcalls)
+def test_passivity_is_one_when_main_freezes():
+    """A main that only ever observes scores passivity 1.0 — the
+    freeze-and-panic signal; an active policy scores low."""
+    from openra_bench.eval_core import scripted_explore_agent
+    frozen = run_handoff(
+        _compiled("medium"), main=stall_policy, prefix=stall_policy,
+        k=2, seed=1,
+    )
+    assert frozen.handoff_stats["passivity"] == 1.0
+    active = run_handoff(
+        _compiled("medium"), main=scripted_explore_agent,
+        prefix=stall_policy, k=2, seed=1,
+    )
+    assert active.handoff_stats["passivity"] < 0.5
+def test_k0_handoff_is_a_full_episode():
+    from openra_bench.eval_core import scripted_explore_agent
+    res = run_handoff(
+        _compiled("easy"), main=scripted_explore_agent,
+        prefix=stall_policy, k=0, seed=1,
+    )
+    assert res.handoff_stats["main_turns"] == res.turns
+# ── Sweep wiring ─────────────────────────────────────────────────────
+def test_handoff_sweep_expands_base_and_bad_cells():
+    from openra_bench.run_eval import evaluate
+    out = evaluate(
+        [PACKS / _PACK], levels=["easy"], seeds=[1],
+        handoff_sweep=True, dry_run=True,
+    )
+    assert set(out["cells"]) == {
+        "perception-count-the-threat:easy:handoff-base",
+        "perception-count-the-threat:easy:handoff-bad",
+    }
+def test_find_win_trajectory_matches_a_banked_win(tmp_path):
+    from openra_bench.run_eval import _find_win_trajectory
+    d = tmp_path / "run" / "p__seed1"
+    d.mkdir(parents=True)
+    (d / "manifest.json").write_text(json.dumps(
+        {"pack_id": "p", "level": "easy", "seed": 1, "outcome": "win"}))
+    (d / "messages.json").write_text("[]")
+    assert _find_win_trajectory(
+        tmp_path, "p:easy:handoff-good", 1
+    ) == str(d / "messages.json")
+    # a different seed / a loss is not matched
+    assert _find_win_trajectory(tmp_path, "p:easy:handoff-good", 2) is None