AgentnessBench / HANDOFF.md
irregular6612's picture
docs: update HANDOFF architecture + paths for game/web layout
247fd95
|
Raw
History Blame Contribute Delete
39.9 kB

PROTEUS โ€” Handoff

Living handoff record. Updated at each checkpoint close. SSOT trio: docs/superpowers/specs/2026-06-01-proteus-arena-slice-design.md (immutable design) + the per-CP plan in docs/superpowers/plans/ + this file.

Last updated: 2026-06-02 โ€” two parallel slices shipped: pack_evade scenario (64ร—64 open-field multi-cell evasion + persona weight-vector handover memory) and web LLM spectating (watch an LLM play the color grid with live reasoning/analysis + pause/step; SpectateSession pinned to SessionRunner(VanillaAgent) by a golden test; providers lazy-imported so offline import-safety holds). Both built in isolated worktrees and merged clean. Before them: CP7/CP8 (LLM-generated memory pre-roll + persona agentness eval) and Web interactive color-grid play; offline, headless. CP6 before that (difficulty layouts + category eval + compare).

Restructure (2026-06-02): game/web layout

The proteus/ package was reorganised into a cleaner layout with zero logic change โ€” 253 tests remain green. The top-level sub-packages are now: game/ (engine, scenarios, agents, runtime, metrics, viz), web/ (local server + arena skeleton), shared/ (utilities), cli/ (parser + per-command modules), and the unchanged providers/. Docs: design/SSOT at docs/superpowers/specs/2026-06-02-proteus-game-web-restructure-design.md; implementation plan at docs/superpowers/plans/2026-06-02-proteus-game-web-restructure.md.

Old path / import New path / import
proteus/grid/game.py proteus/game/engine/grid.py
proteus/grid/scenario.py proteus/game/scenarios/base.py
proteus/grid/scenarios/<x>.py proteus/game/scenarios/<x>.py
proteus/grid/difficulty.py proteus/game/engine/difficulty.py
proteus/grid/ascii_view.py proteus/game/engine/ascii_view.py
import proteus.grid (registry side-effect) import proteus.game.scenarios
proteus/arc_grid/ / proteus.arc_grid.* proteus/game/engine/ / proteus.game.engine.* (vendored engine at proteus/game/engine/arcengine/)
proteus/runtime/metrics.py proteus/game/metrics/metrics.py
proteus/runtime/persona.py proteus/game/metrics/persona.py
proteus/runtime/rollout.py proteus/game/metrics/rollout.py
proteus/runtime/aggregate.py proteus/game/metrics/aggregate.py
proteus/runtime/<other>.py (session, _session_core, interactive, spectate, trace, io, memory, memory_gen, memory_policies) proteus/game/runtime/<x>.py
proteus/agents/<x>.py proteus/game/agents/<x>.py
proteus/viz/<x>.py / proteus.viz proteus/game/viz/<x>.py
proteus/web/server.py proteus/web/local/server.py
proteus/web/static/index.html proteus/web/local/static/index.html
import proteus.web.server import proteus.web.local.server
python -m proteus.web python -m proteus.web.local (old command still works via a shim)
proteus/cli.py proteus/cli/ package (cli/__init__.py, cli/parser.py, cli/commands/{run,play,memory,replay,compare,list_scenarios}.py)

Done (CP0โ€“CP3)

  • CP0 โ€” proteus package skeleton + pyproject.toml (hatchling, pydantic/numpy/pyyaml, pytest config); arc_grid vendored byte-for-byte into proteus/game/engine/arcengine/ (verified diff -r empty; firewall holds โ€” zero squid_game/proteus cross-imports inside it).
  • CP1 โ€” grid family ported with squid_game.* โ†’ proteus.* renames only: difficulty, scenario (ABC + registry), game (MotiveGridGame), ascii_view, scenarios/predator_evade. Registry self-populates on import proteus.game.scenarios. Acceptance gate locked: deterministic EASY handover (focal (3,3) / predator (5,3)) and the diagnostic invariant optimal "up" โ‰  habit "left".
  • CP2 โ€” slim Agent ABC + frozen ActResult; extract_action parser; VanillaAgent (act + optional probe, reasoning via parse_thinking_tags); providers base + thinking_utils (pure ports) + new FakeProvider (scripted, offline).
  • CP3 โ€” lean pydantic TurnTrace/SessionTrace (outcome is Literal["survived","eliminated"]); compute_metrics; SessionRunner orchestrator (Cut replay โ†’ handover โ†’ N-turn play โ†’ scored SessionTrace). Full offline end-to-end with JSON(L) round-trip.

Done (CP4 โ€” real-provider CLI; plan: docs/superpowers/plans/2026-06-01-proteus-arena-cli.md)

  • Provider factory (proteus/providers/factory.py) โ€” make_provider("name:model") + available_providers(). Lazy string registry (module_path, class_name): importing proteus.providers or building a FakeProvider never imports an SDK (offline invariant holds). fake:<model> โ†’ offline FakeProvider; partition(":") so model ids with : (ollama) survive.
  • Cloud providers ported (openai / anthropic_provider / gemini / ollama_cloud) by import-rename ONLY (squid_game.* โ†’ proteus.*); verified by file-text tests (SDKs not installed offline). Class names match the registry. local/mlx/cuda providers not ported (deferred โ€” need a local server / Apple-silicon / CUDA; off the CP4 "one cheap model" gate).
  • JSONL trace I/O (proteus/game/runtime/io.py) โ€” append_trace / read_traces (one SessionTrace per line; creates parent dirs; skips blank lines).
  • CLI (proteus/cli/ package, proteus/__main__.py) โ€” run / list-scenarios / replay. run validates --model (unknown โ†’ stderr+exit 2) AND --scenario (unknown โ†’ stderr+exit 2, before any session runs); replay prints per-turn action vs motive/habit + metrics. python -m proteus.

Real-model smoke (CP4 acceptance gate โ€” passed)

  • Model: ollama:gpt-oss:120b-cloud (Ollama Cloud, native /api/chat). seed 42 / easy / play-turns 10 / probe on.
  • Outcome: survived | motive_reading_accuracy=70.0% | reactivity_index=75.0% (survival_fraction 100%, first_divergence_turn 3). Real reasoning captured (3.9kโ€“13.5k chars/turn). Confirms the full chain CLIโ†’agentโ†’OllamaCloudProviderโ†’ollama.com and the lazy-imported port.
  • How it was run (repeatable): .venv is kept SDK-free. A throwaway venv at /tmp/proteus-smoke-venv (only pydantic numpy pyyaml httpx โ€” ollama needs just httpx) is run with PYTHONPATH=<repo> /tmp/proteus-smoke-venv/bin/python -m proteus run --model ollama:gpt-oss:120b-cloud .... runs/ and .env are gitignored (trace + key never committed).
  • Key gotcha (carry forward): the .env OLLAMA_API_KEY value has a leading space + trailing extra tokens; the real key is the first whitespace-delimited token (57 chars). Passing the whole value โ†’ 401. The smoke extracts ... | awk '{print $1}'. Cleaning the .env line (key only) would remove the need for that. OLLAMA_API_KEY2 exists but is not authorized for use.

Done (CP4.5 โ€” per-turn LLM-response accounting; no separate plan, executed in-session)

Motivation: the smoke surfaced that token accounting and raw text were dropped (thinking_tokens=0 every turn, no raw_text). Fixed via TDD 3-gate, all ADDITIVE except the probe return-type change:

  • Act path (1c18fa9): ActResult gains input_tokens/output_tokens/thinking_tokens; VanillaAgent.act() propagates them from CompletionResult with thinking_tokens = result.thinking_tokens or <inline-think parser count> (provider count preferred). TurnTrace gains raw_text/input_tokens/output_tokens (its thinking_tokens field is now populated); SessionRunner.run() wires them.
  • Probe path (f95852b): new frozen ProbeResult(answer, reasoning, raw_text, input/output/thinking tokens); Agent.probe returns it (was str); TurnTrace gains probe_reasoning/probe_raw_text/ probe_{input,output,thinking}_tokens; SessionRunner wires them (probe_a = probe.answer). Probe reasoning falls back to "" (not the answer โ€” a probe answer is not its own reasoning). A regression test proves probe-side and act-side token fields are not cross-wired (probe=3, act=5).

Done (CP5 โ€” human play + trace visualization; plan: docs/superpowers/plans/2026-06-01-proteus-arena-viz-humanplay.md)

Executed task-by-task via subagent-driven-development (3-gate: implement โ†’ spec review โ†’ code-quality review). All offline/headless; no network, no display, no provider SDK added.

  • HumanAgent (proteus/game/agents/human.py, Tasks 1โ€“2) โ€” implements the Agent ABC with I/O injection (input_fn/output_fn, resolved from builtins at construction so tests drive it headlessly). act parses canonical actions + WASD shortcuts (w/a/s/d), is case/whitespace- insensitive, re-prompts on invalid input; probe returns a typed ProbeResult; name="human". Routed through the unmodified SessionRunner, so a human trace is schema- AND value-identical to an LLM trace under the same actions (only model differs). Locked by tests/runtime/test_human_comparability.py (Task 3, spec ยง10).
  • proteus/game/viz/ (Tasks 4โ€“6) โ€” reconstruct.py rebuilds pixel frames by deterministically replaying the trace through the real engine (scenario, seed, difficulty + recorded actions), with double self-verify (every Cut frame's ASCII == stored cut_frames; sprite positions before each turn == stored focal_pos/predator_pos) โ†’ raises TraceReconstructionError on any divergence. terminal.py renders truecolor (24-bit ANSI) block grids + a side panel (action/motive/habit/reward/ tokens + reasoning excerpt, from the CP4.5 enrichment). png.py writes per-frame frame_NNN.png via matplotlib Agg (imported lazily inside write_pngs).
  • CLI (Tasks 8โ€“9) โ€” play subcommand (human via stdin; probe OFF by default, opt-in --probe; optional --out) + replay --visual/--png DIR/--fps (text replay stays the default, CP4 non-destructive; viz imported lazily only when a visual flag is set).
  • Offline import invariant made real (Task 7, user-approved root-cause fix): proteus/game/engine/arcengine/ rendering.py's top-level try: import matplotlib was moved lazily inside render_frames, so importing the engine / proteus.game.viz no longer pulls matplotlib in ANY ordering. tests/viz/ test_import_safety.py guards it. (This intentionally diverges from the untracked top-level arc_grid/ vendored copy โ€” that copy is not on the import path.)

CP5 acceptance demo (passed)

  • Full suite: .venv/bin/python -m pytest -q โ†’ 103 passed, offline, no display.
  • Comparability: human (runs/cp5_human.jsonl) and fake-LLM (runs/cp5_llm.jsonl) at seed 42 / play-turns 6, both committing the same action (CLI fake: emits ACTION: stay, so the human was driven with stay to match) โ†’ identical cut_frames, per-turn action, and motive_action answer keys (['up','up']); differ only in model (human vs demo). NOTE: the plan's demo drove the human with up vs a stay-playing fake LLM โ€” that necessarily diverges (different actions โ†’ different world), so the demo was corrected to use a matching action; the comparability invariant itself (identical actions โ†’ identical trace) is unchanged and unit-tested.
  • Visual: replay --visual --fps 0 renders the truecolor grid + side panel; replay --png runs/cp5_frames wrote 5 non-zero frame_*.png (cut_length 2 + 1 initial + 2 play turns).

Done (CP6 โ€” difficulty layouts + category eval + human-baseline harness; plan: docs/superpowers/plans/2026-06-02-proteus-cp6-difficulty-eval-baseline.md; design: docs/superpowers/specs/2026-06-02-proteus-cp6-difficulty-eval-baseline-design.md)

Executed task-by-task via subagent-driven-development (3-gate: implement โ†’ spec review โ†’ code-quality review). All offline/headless; no network, no display, no provider SDK added. Full suite: 127 passed (103 CP5 baseline + 24 CP6 additions).

  • Section A โ€” difficulty layouts (Tasks 1โ€“3): Scenario.build_level(rng, difficulty) now takes the band; record_focal_move/safety_distance promoted to concrete defaults on the Scenario ABC (resolves a CP5 deferred item). predator_evade dispatches a hand-authored per-difficulty layout table (_Layout dataclass + _LAYOUTS): EASY 8ร—8 (byte-for-byte unchanged), MEDIUM 10ร—10, HARD 12ร—12 (L-trap), EXPERT 12ร—12 (forked corridor). build_level sets the instance attr self.grid_size so the camera/within_bounds size per-band. Golden test tests/grid/test_difficulty_layouts.py is the oracle: per band it locks determinism + the handover + the diagnostic invariant optimal_action != habit_action with habit == "left" blocked. Candidate coordinates satisfied the invariant with no tuning (all four bands: optimal=up โ‰  habit=left). rules_text de-hard-coded ("8x8 grid" โ†’ "a grid") so it no longer lies on the larger bands.
  • Section B โ€” category-specific eval (Tasks 4โ€“6): per-turn reward ownership moved into the scenario. New abstract Scenario.step_reward(game, action, blocked, focal_before, predator_before); SessionRunner._reward + its 5 constants deleted (delegates now). predator_evade survival reward = BFS-distance gain measured against the predator's pre-move cell (awayโ†’positive float(d_after โˆ’ d_before), towardโ†’negative, wall-hitโ†’โˆ’3.0, terminal โˆ’50/+50), isolating the agent's own move from the chase. New proteus/game/metrics/rollout.py optimal_rollout(...) (engine-replay, same pattern as game/viz/reconstruct, imports game.scenarios only). metrics.py gains four additive keys: away_move_fraction (survival headline), mean_step_reward, trajectory_agreement (vs the optimal rollout, denominator = all played turns), final_distance_gap. Existing four metrics retained; test_human_comparability's set(h.metrics)==set(v.metrics) still holds.
  • Section C โ€” human-baseline harness (Tasks 7โ€“8): pure proteus/game/metrics/aggregate.py aggregate_traces(traces) โ†’ {(model, difficulty): {"n", "metrics": means}} (no I/O, union-of-keys robust). New CLI proteus compare <trace.jsonl>โ€ฆ [--out summary.json] (rc 2 not-found / rc 1 empty / rc 0, mirrors replay; JSON keys stringified "model|difficulty").

CP6 acceptance demo (passed)

  • Full suite: .venv/bin/python -m pytest -q โ†’ 127 passed, offline, no display.
  • Per-difficulty smoke: proteus run โ€ฆ --difficulty {easy,medium,hard,expert} --model fake:demo --seed 42 --play-turns 6 --no-probe then proteus compare โ€ฆ โ†’ four model=demo difficulty=<band> n=1 groups, each with the 8 metric means (incl. away_move_fraction/trajectory_agreement/final_distance_gap), summary written to runs/cp6_summary.json. Values differ by band (e.g. EXPERT trajectory_agreement=66.67, survival_fraction=50.0 vs others 50.0/33.33), confirming the layouts/metrics are genuinely band-sensitive.
  • Invariant acceptance: all four bands print optimal=up habit=left with distinct grids (8,8)/(10,10)/(12,12)/(12,12) โ€” optimal != habit everywhere.
  • Note: the fake:demo provider plays a fixed non-optimal action (so motive_reading_accuracy=0); the smoke validates the deterministic pipeline + metric surfacing, not agent skill.

Done (Web interactive play โ€” color-grid browser arena; plan: docs/superpowers/plans/2026-06-02-proteus-web-interactive-play.md; design: docs/superpowers/specs/2026-06-02-proteus-web-interactive-play-design.md)

Executed task-by-task via subagent-driven-development (implement โ†’ spec review โ†’ code-quality review per task). The web slice alone was 149 passed (127 CP6 baseline + 22 new web/runtime tests); after merging with the concurrent CP7/CP8 work the combined suite is 193 passed. Merge reconciliation: the SessionRunnerโ†’_session_core delegation refactor was kept as the basis and CP7 memory + CP8 persona were threaded through the shared helpers, so both SessionRunner and InteractiveSession (and thus web play) now emit CP8-rich traces from one source of truth.

  • Stdlib-only color-grid web play โ€” launch with python -m proteus.web.local (--host/--port, defaults to 127.0.0.1:8000; python -m proteus.web still works via a shim). A single static index.html (vanilla JS, no build, no CDN) renders the engine's integer grids in color via proteus.game.engine.rendering.COLOR_MAP. Zero new dependencies โ€” the .venv stays SDK-free/offline.
  • Shared session core โ€” extracted SessionRunner's per-session mechanics into proteus/game/runtime/_session_core.py (pure refactor, no behavior change โ€” the existing tests stayed green at the same count). Both SessionRunner and the new InteractiveSession build their traces from these same helpers.
  • InteractiveSession (proteus/game/runtime/interactive.py) โ€” a threadless, stepwise driver (state()/step()/finish()) that advances one turn per HTTP request. It is pinned to SessionRunner(HumanAgent) by a golden equivalence test (tests/runtime/test_interactive_equivalence.py) asserting full model_dump() equality for the same action sequence โ€” the two paths cannot drift.
  • Fairness split (carried from CP5) โ€” the live state exposes only the color grid + available actions (no reward/optimal/habit), exactly what the LLM sees; the full per-turn disclosure (your action vs optimal vs habit, reward, 8 metrics) appears in review only once the game is over. Asserted in both the InteractiveSession and server tests.
  • Always-saved trace โ€” every finished game appends a runs/web_<scenario>_<difficulty>.jsonl line (override via the /finish {out} body) that is schema-identical to LLM/CLI traces. Verified end-to-end: a web-produced trace is accepted by both proteus replay and proteus compare (shows up as a model=human baseline group).
  • HTTP API (stdlib http.server) โ€” GET / (page), GET /config, POST /session, GET /session/{id}, POST /session/{id}/act, POST /session/{id}/finish; a pure socket-free handle_request(...) router (unit-tested without a socket) + in-memory registry; structured errors (400/404/409/500); offline import-safety guarded (tests/web/test_import_safety.py).

Done (Web LLM spectating โ€” watch an LLM play in color; plan: docs/superpowers/plans/2026-06-02-proteus-web-spectate.md; design: docs/superpowers/specs/2026-06-02-proteus-web-spectate-design.md)

Executed task-by-task via subagent-driven-development (TDD, commit-per-task). Full suite: 202 passed (193 baseline + 9 new spectate tests), offline, no display, no provider SDK added. A person can now watch an LLM play the arena in the browser: the color grid auto-advances turn by turn while a live panel shows the model's reasoning plus the per-turn optimal/habit/reward (rich disclosure โ€” a spectator is not the scored subject), with pause / step controls.

  • SpectateSession (proteus/game/runtime/spectate.py) โ€” a threadless, agent-driven sibling of InteractiveSession. state()/advance()/finish() over the shared _session_core; the server (not a human) supplies each action by calling agent.act(...) once per HTTP request. It reads default_memory via getattr(built, "default_memory", None) (forward-compatible with pack_evade) and passes memory= to build_observation. It is pinned to SessionRunner(VanillaAgent(same FakeProvider)) by a golden equivalence test (tests/runtime/test_spectate_equivalence.py) asserting full model_dump() equality โ€” the spectate path cannot drift from the runner path.
  • HTTP API โ€” POST /spectate, POST /spectate/{id}/next, POST /spectate/{id}/finish, GET /spectate/{id}; /config now also returns providers + default_model (fake:demo). Providers/agents are lazy-imported inside the create handler (available_providers locally in _config_payload), so import proteus.web.local.server still pulls no provider SDK โ€” the offline import-safety guard is extended and stays green. Structured errors: unknown model โ†’ 400, advance-after-done โ†’ 409, provider failure โ†’ 500 (no stack leak).
  • Frontend (proteus/web/local/static/index.html) โ€” a Play/Spectate mode toggle + model input; in spectate mode the human pad/keys go inert, a live analysis panel renders the model's reasoning + a per-turn table (turn/action/optimal/habit/reward/=opt?), and an auto-play loop drives /next with Pause/Resume and Step. On end it shows SURVIVED/ELIMINATED and the saved trace path.
  • Trace โ€” spectated runs save a proteus compare-compatible LLM-baseline trace (runs/web_spectate_*.jsonl by default) accepted by proteus replay / proteus compare as a model=<provider model> group (verified e2e with fake:demo).

How to run

  • Environment: python is not on PATH; a gitignored venv lives at .venv (Python 3.12, pydantic v2 / numpy / pyyaml / pytest + matplotlib โ€” the latter now required by the viz/png tests). Full suite: .venv/bin/python -m pytest -q โ†’ 148 passed, no network, no display. Provider SDKs are NOT installed in .venv (keeps the offline invariant structural); matplotlib runs Agg-only (headless), so it does not break the offline/no-display invariant. Recreate .venv with: uv venv --python 3.12 .venv && uv pip install --python .venv/bin/python "pydantic>=2" "numpy>=1.26" "pyyaml>=6" "pytest>=8" "matplotlib>=3.8"
  • Human play (offline): printf 'up\nup\n...' | .venv/bin/python -m proteus play --scenario predator_evade --seed 42 --play-turns 6 --out runs/h.jsonl (live view is the same ASCII the LLM sees).
  • Visual replay: .venv/bin/python -m proteus replay runs/h.jsonl --visual --fps 0 (truecolor terminal) or --png runs/frames (per-frame PNGs). Plain replay runs/h.jsonl stays text-only.
  • Web interactive play (offline, stdlib-only): .venv/bin/python -m proteus.web.local (--host/--port, defaults 127.0.0.1:8000), then open the printed http://127.0.0.1:8000/ โ€” play the color grid in the browser; the finished game appends a schema-identical trace to runs/web_<scenario>_<difficulty>.jsonl (replayable/comparable with proteus replay/compare). Zero new dependencies.
  • Offline CLI smoke: .venv/bin/python -m proteus run --scenario predator_evade --model fake:demo --seed 42 --play-turns 5 --out runs/x.jsonl && .venv/bin/python -m proteus replay runs/x.jsonl
  • Difficulty bands: add --difficulty {easy,medium,hard,expert} to run/play (EASY is the default, byte-for-byte the CP0โ€“CP5 world). Human-baseline compare: collect matched human (play --out) and LLM (run --out) traces, then .venv/bin/python -m proteus compare h.jsonl llm.jsonl --out summary.json โ†’ per-(model, difficulty) metric means + n (the model token is the part AFTER the colon, e.g. fake:demo โ†’ demo).
  • Memory pre-roll (CP7): generate a checkpoint then reuse it โ€” .venv/bin/python -m proteus memory --scenario predator_evade --model fake:demo --seed 42 --memory-turns 8 writes runs/memory/<model>/<stamp>.json; proteus run โ€ฆ --memory latest (or --memory generate to do both in one shot, --memory <path> for a specific file, default none) injects it at the handover. runs/memory/ is gitignored. Korean usage guide: docs/USAGE-ko.md.
  • Real-model smoke: see the throwaway-venv recipe above (needs a provider SDK + key; never in .venv).

Current state

A scored SessionTrace is produced offline (FakeProvider) AND against real cloud models via the CLI; each turn persists full act + probe accounting to runs/*.jsonl. Single scenario only: predator_evade (survival motive). Human play and trace visualization now exist (CP5): proteus play runs a human through the same SessionRunner (comparable trace), and proteus replay --visual/--png renders any trace as truecolor terminal frames or PNGs by deterministically replaying the engine. proteus replay with no flags still prints TEXT (CP4 behavior preserved). proteus/game/engine/arcengine/rendering.py is now actively reused (palette + ANSI + RGB helpers) and its matplotlib import is lazy.

CP6 adds: four genuinely distinct hand-authored difficulty layouts (EASY unchanged), scenario-owned category-specific reward (step_reward; survival = reward for moving away from the predator), an optimal-rollout trajectory metric (game/metrics/rollout.py) + four new metric keys, and a proteus compare human-baseline aggregation CLI. Still a single scenario (predator_evade, survival motive) โ€” the step_reward/safety_distance ABC surface is now open for CP7 to plug new motive categories in without touching SessionRunner.

Done (CP7 โ€” LLM-generated memory pre-roll; design: docs/superpowers/specs/2026-06-02-proteus-cp7-memory-preroll-design.md; plan: docs/superpowers/plans/2026-06-02-proteus-cp7-memory-preroll.md)

Executed inline (executing-plans, TDD, commit-per-task). Full suite: 152 passed (127 baseline + 25 CP7), offline/headless; no provider SDK added to .venv. The original "CP7 = new motive category" idea was renamed to CP8 when the user pivoted to this memory feature; curiosity/sociality stay deferred.

  • Stage 1 โ€” game/runtime/memory.py (pure pydantic + stdlib, no intra-runtime imports, like trace.py): MemoryTurn/MemoryCheckpoint models; save_checkpoint/load_checkpoint (single-file JSON at runs/memory/<safe(model)>/<stamp>.json); latest_for_model (lexical stamp = chronological); render_memory_block (pure observation renderer). Exported from game/runtime/__init__.
  • Stage 2 โ€” game/runtime/memory_gen.py (engine-coupled, like game/metrics/rollout.py): generate_memory(scenario, agent, *, difficulty, seed, memory_turns, model_name, clock) runs a full-info self-play episode (max_steps=memory_turns, no scripted Cut, no answer-key leakage), capturing frames/actions/reasoning โ†’ MemoryCheckpoint. Deterministic under an injected clock. New scenario-sourced Scenario.memory_brief (concrete attr default ""); predator_evade sets a transparent brief (discloses the BFS chase). This resolves the memory-prompt half of the old "source prompts from Scenario" item; the _PROBE_QUESTION/_ACTION_DIRECTIVE predator-framing is still hard-coded (carried to CP8).
  • Stage 3 โ€” hybrid handover (trace.py + session.py): additive SessionTrace.memory_ref: str | None; SessionRunner(..., memory=None, memory_ref=None) prepends the memory block (+ NOW โ€” this run so far:) to the turn-1 observation only, before the unchanged scripted Cut. step_reward, answer keys, metrics, and _session_core.py untouched โ†’ diagnostic invariant + 127 preserved (pinned by test_session_memory.py: with/without memory the per-turn motive_action/habit_action/is_diagnostic and metrics are identical; only the turn-1 observation grows).
  • Stage 4 โ€” CLI (proteus/cli/ package): proteus memory (generate+save a checkpoint) and proteus run --memory MODE where MODE โˆˆ {none(default), generate, latest, <path>} via a _resolve_memory helper. --memory-turns (default 10) and --memory-root (default runs/memory) on both. Exit codes mirror run (unknown model/scenario โ†’ 2; latest with no checkpoint โ†’ 2).

CP7 acceptance smoke (passed)

  • Offline end-to-end: proteus memory โ€ฆ fake:demo โ†’ proteus run โ€ฆ --memory latest โ†’ the turn-1 observation carries the MEMORY block + NOW separator + scripted Cut 1/2: in order; memory_ref set.
  • Ollama (networked, manual): ollama:gpt-oss:120b-cloud, seed 42 / easy. proteus memory --memory-turns 6 โ†’ the real model self-played 6 turns and survived (vs the fake stay agent which is eliminated in 2), checkpoint written to runs/memory/gpt-oss_120b-cloud/<stamp>.json. Then run --play-turns 8 --memory latest โ†’ survived | motive_reading_accuracy=50% | reactivity_index=40%; turn-1 observation = 962 chars containing the 6-turn memory block + the scripted Cut; memory_ref="gpt-oss:120b-cloud@<stamp>". .venv stayed SDK-free (throwaway /tmp/proteus-smoke-venv with httpx; key = first whitespace token of the .env value, CP4 gotcha). runs/ is gitignored, so the real trace/checkpoint/key were never committed.

Done (CP8 โ€” predator agentness eval + auto-GIF; Stages 0/1/2/4 implemented this session)

Design: docs/superpowers/specs/2026-06-02-proteus-predator-agentness-eval-design.md (user-authored). Plan: docs/superpowers/plans/2026-06-02-proteus-cp8-agentness-eval.md. TDD-ready Stages 0, 1, 2, 4 are complete (additive, 152โ†’175 tests, full suite green). Stages 3 & 5 remain DESIGN-GATED (below).

Replaces the single survived signal with a three-layer agentness eval โ€” survival, distance trajectory, and memory-persona maintenance (does the model continue the persona its self-memory demonstrates, not just survive?). proteus replay/compare surface every new metric automatically.

  • Stage 0 โ€” auto-GIF โœ… proteus/game/viz/gif.py:write_gif (Pillow lazy) + run/play auto-render <out>.gif next to the trace (default on, --no-gif). Verified: 5-frame 256ร—256 animated GIF.
  • Stage 1 โ€” metric-only โœ… per-turn post_focal_pos/post_predator_pos/pre_bfs_distance/ post_bfs_distance/agent_distance_delta on TurnTrace; scenario max_bfs_distance/ agent_distance_delta (predator_evade impls; ABC None defaults); episode metrics time_to_capture/distance_auc/min_distance/near_capture_count; episode turn_order (focal_then_predator)/capture_rule (same_cell)/horizon. away_move_fraction kept (already pre-predator-based via step_reward). Mirrored in _session_core.make_turn_trace (web path).
  • Stage 2 โ€” persona reference โœ… game/metrics/persona.py: hidden PersonaWeights + R_w (reward_rw) + reference_actions (argmax set, tiesโ†’all) + pressure; built-ins risk_averse/ risk_seeking/survival_optimal. Per-turn reference_actions/reference_reward/model_reward/ reward_regret/pressure + episode persona_weight_id; metrics action_agreement/reward_regret/ pressure_weighted_agreement/persona_drift_turn (present only when a persona ran โ€” additive). proteus run --persona <id>; weights never serialized / never in the prompt (verified no leak).
  • Stage 4 โ€” hidden-weight memory โœ… generate_memory(persona=) โ†’ deterministic persona demonstration (reference policy plays, agent may be None); MemoryCheckpoint.persona_weight_id (public id only). proteus memory --persona <id>; run --memory <ckpt> --persona <id> scores whether the model continues the demonstrated persona. Weight-agnostic transparent brief (no leakage).

Still TODO โ€” DESIGN-GATED (each needs superpowers:brainstorming + sub-spec BEFORE coding; do not fabricate from the plan):

  • Stage 3 โ€” simultaneous resolver (DESIGN-GATED): planโ†’resolve engine turn + crossing capture; changes capture goldens โ†’ own brainstorm+sub-spec; keep focal_then_predator selectable via a --turn-order flag (default unchanged). Adds per-turn same_cell_capture/crossing_capture/captured flags + refines time_to_capture to the real captured turn. See plan Stage 3 for the 4 open decisions.
  • Stage 5 โ€” multi-feature personas (DESIGN-GATED): greed/compliance/cooperation need new scenario features (resources/norms/social) โ€” this is where the previously-deferred "new motive category / curiosity" work lands; needs its own spec. Predator-only must NOT over-interpret those personas (spec ยง9). PersonaWeights already stubs resource_reward/norm_cost/social_weight for this.

The VanillaAgent._ACTION_DIRECTIVE / SessionRunner._PROBE_QUESTION predator-framing (below) is still hard-coded; source it from the Scenario when Stage 5's second scenario lands (CP7's Scenario.memory_brief is the pattern). LLM-as-judge reasoning scoring + web UI / leaderboard remain deferred (spec ยง12). Memory length today = --memory-turns (default 10; survived โ†’ exactly N turns, captured โ†’ fewer).

Done (pack_evade scenario โ€” 64ร—64 open-field multi-cell evasion; plan: docs/superpowers/plans/2026-06-02-proteus-pack-evade-scenario.md; design: docs/superpowers/specs/2026-06-02-proteus-pack-evade-scenario-design.md)

Executed task-by-task with strict TDD in an isolated worktree branched from CP7/CP8 master. Full suite: 206 passed (193 baseline + 13 new: 7 scenario behaviour + 2 footprint bounds + 2 render_frame + 2 manual-memory). Offline, headless; zero new dependencies.

  • New pack_evade scenario (proteus/game/scenarios/pack_evade.py, registered alongside the untouched predator_evade): a 64ร—64 open field, no walls; multi-cell sprites (predator 5ร—5, focal 3ร—3); eat = footprint AABB overlap (not adjacency); center-to-center Manhattan geometry โ€” no BFS / no O(Nยฒ) on 64ร—64 (analytic max distance = (64-1)+(64-1) = 126). Pure evasion: habit_action == optimal_action always, so no diagnostic (is_diagnostic is False).
  • Engine stays single-focal. The 3-prey pack hunt (two caught) lives only in the hand-authored handover memory; the live engine gained one additive change: footprint-aware focal bounds in MotiveGridGame.step() (_footprint_in_bounds) so a multi-cell focal keeps its full footprint on-grid. The 1ร—1 path is byte-identical (regression-guarded) โ€” predator_evade unchanged.
  • Scenario.render_frame(game) hook (game/scenarios/base.py default = full ASCII; _session_core.render_ascii delegates). pack_evade overrides it with a compact one-line coordinate observation (no 4096-char map); predator_evade's observation/cut-frames are byte-identical to before.
  • Scenario.default_memory(seed, difficulty) hook (default None) + wiring: BuiltSession.default_memory is computed in build_session; session.py/interactive.py resolve self._memory or built.default_memory, so an explicit memory= still overrides. pack_evade.default_memory generates the handover memory from a hidden persona weight vector (get_persona("risk_averse")) via the existing CP7/CP8 memory_gen.generate_memory(..., persona=...) in its reference-policy mode โ€” the hidden policy actually plays a pack_evade episode and that self-play trajectory is the memory (per the agentness design, docs/agentness_game_design_from_paper.md ยง3-4). Only the public persona_weight_id is recorded; the raw weights never leak. Two compat shims (_is_free=in-bounds, _bfs_distance=Manhattan, since the field is wall-free) let the shared persona policy run on the open field; generate_memory now renders frames via scenario.render_frame (compact on pack_evade, ASCII-identical on predator_evade). The CP7 LLM self-play memory_gen path is otherwise untouched.
  • Trace schema-/value-compatible with the existing tooling: verified end-to-end with proteus run --scenario pack_evade --model fake:demo โ†’ proteus replay (per-turn action vs optimal, habit==optimal, survival metrics) โ†’ proteus compare (model=demo difficulty=easy n=1).
  • Real-LLM smoke (passed): ollama:gpt-oss:120b-cloud on pack_evade โ€” both the CLI run (4 turns, survived) and the web Spectate path (/spectateโ†’/nextร—N, live ~3โ€“6.6k-char reasoning per turn, survived). The model reads the compact 64ร—64 observation and the injected persona memory and produces valid actions. Run with the SDK-free .venv untouched, via the throwaway /tmp/proteus-smoke-venv (httpx only) + PYTHONPATH=<repo> + cleaned OLLAMA_API_KEY (first whitespace token of the .env value โ€” see the CP4 smoke gotcha). --no-gif since the smoke venv has no matplotlib.

Merge note for master: this branch changed five shared files โ€” proteus/game/scenarios/base.py (two additive ABC defaults), proteus/game/engine/grid.py (footprint bounds), proteus/game/runtime/_session_core.py (render_ascii delegate + BuiltSession.default_memory), proteus/game/runtime/session.py and proteus/game/runtime/interactive.py (effective-memory fallback). All changes are additive/back-compatible; reconcile with any concurrent CP7/CP8 edits to those files at merge time.

Deferred items (carry forward)

  • local/mlx/cuda providers un-ported (need local server / Apple-silicon / CUDA; out of CP4 scope).
  • VanillaAgent._ACTION_DIRECTIVE and SessionRunner._PROBE_QUESTION hard-code the "predator" framing โ€” fine for the single-scenario slice; source these from the Scenario when CP6+ adds scenarios.
  • SessionRunner._provider_model_name reaches into agent._provider by naming convention; consider an explicit model_name on the Agent base when more agent types land. (CP5's HumanAgent has no _provider, so model correctly falls back to agent.name == "human".)
  • reasoning/probe are recorded but not scored (LLM-as-judge deferred per spec ยง12).
  • Scenario.record_focal_move is not on the ABC โ€” RESOLVED (CP6 Task 1): promoted to a non-abstract no-op default on Scenario (alongside safety_distance), so multi-scenario callers never hit AttributeError.
  • CLI --png without the viz extra raises a raw ModuleNotFoundError traceback instead of a clean rc=2 (every other expected CLI failure returns a structured code). Not reachable today (matplotlib is in .venv and no packaging extras are defined yet). Add a try/except ImportError guard + companion test when packaging/extras land. Same pattern for surfacing TraceReconstructionError from replay on a corrupt trace, and a --fps-is-terminal-only help note.
  • viz.reconstruct terminal flag with empty trace.turns: if a trace had zero played turns but a non-empty outcome, the final (cut) frame would show no terminal annotation. Not reachable today (SessionRunner guards an empty play phase with a RuntimeError); add a guard/comment if the trace format ever permits empty-turns-with-outcome.

Open questions

  • Package/project README rewrite (replace the deprecated persona-arena depreated/README.MD).
  • Trajectory metric: first_divergence_turn is a coarse proxy โ€” RESOLVED (CP6): added trajectory_agreement (position step-agreement vs the optimal rollout) + final_distance_gap (spec ยง7 "๊ถค์  ์ผ์น˜"). first_divergence_turn is kept additively; open whether to retire it once the new pair proves sufficient (CP6 design ยง8). Also open (CP6 design ยง5.2): exact reward grading (delta vs ยฑ1) is pinned by the property test, revisit if a future category needs a different shape.
  • If a 3rd LLM-call result type appears (e.g. a reflect call), extract a shared _LLMCallResult base for the common reasoning/raw_text/{input,output,thinking}_tokens fields (ActResult/ProbeResult currently duplicate them โ€” acceptable for two).
  • Probe-vs-act-reasoning redundancy (deferred from CP5, spec-change territory): both act and probe capture reasoning per turn; whether the per-turn probe is worth its cost vs. reusing act reasoning is a measurement-design question for a spec revision, out of CP5 scope.
  • Optional live-color human play: CP5 keeps the human's live view as plain ASCII (fairness โ€” the human reads exactly what the LLM reads). A truecolor live mode for human play (distinct from post-hoc replay --visual) is possible but intentionally deferred to preserve baseline fairness.