Spaces:
Sleeping
PROTEUS โ Handoff
Living handoff record. Updated at each checkpoint close. SSOT trio:
docs/superpowers/specs/2026-06-01-proteus-arena-slice-design.md(immutable design) + the per-CP plan indocs/superpowers/plans/+ this file.
Last updated: 2026-06-02 โ two parallel slices shipped: pack_evade scenario (64ร64 open-field
multi-cell evasion + persona weight-vector handover memory) and web LLM spectating (watch an LLM
play the color grid with live reasoning/analysis + pause/step; SpectateSession pinned to
SessionRunner(VanillaAgent) by a golden test; providers lazy-imported so offline import-safety holds).
Both built in isolated worktrees and merged clean. Before them: CP7/CP8 (LLM-generated memory pre-roll +
persona agentness eval) and Web interactive color-grid play; offline, headless. CP6 before that
(difficulty layouts + category eval + compare).
Restructure (2026-06-02): game/web layout
The proteus/ package was reorganised into a cleaner layout with zero logic change โ 253 tests remain green. The top-level sub-packages are now: game/ (engine, scenarios, agents, runtime, metrics, viz), web/ (local server + arena skeleton), shared/ (utilities), cli/ (parser + per-command modules), and the unchanged providers/. Docs: design/SSOT at docs/superpowers/specs/2026-06-02-proteus-game-web-restructure-design.md; implementation plan at docs/superpowers/plans/2026-06-02-proteus-game-web-restructure.md.
| Old path / import | New path / import |
|---|---|
proteus/grid/game.py |
proteus/game/engine/grid.py |
proteus/grid/scenario.py |
proteus/game/scenarios/base.py |
proteus/grid/scenarios/<x>.py |
proteus/game/scenarios/<x>.py |
proteus/grid/difficulty.py |
proteus/game/engine/difficulty.py |
proteus/grid/ascii_view.py |
proteus/game/engine/ascii_view.py |
import proteus.grid (registry side-effect) |
import proteus.game.scenarios |
proteus/arc_grid/ / proteus.arc_grid.* |
proteus/game/engine/ / proteus.game.engine.* (vendored engine at proteus/game/engine/arcengine/) |
proteus/runtime/metrics.py |
proteus/game/metrics/metrics.py |
proteus/runtime/persona.py |
proteus/game/metrics/persona.py |
proteus/runtime/rollout.py |
proteus/game/metrics/rollout.py |
proteus/runtime/aggregate.py |
proteus/game/metrics/aggregate.py |
proteus/runtime/<other>.py (session, _session_core, interactive, spectate, trace, io, memory, memory_gen, memory_policies) |
proteus/game/runtime/<x>.py |
proteus/agents/<x>.py |
proteus/game/agents/<x>.py |
proteus/viz/<x>.py / proteus.viz |
proteus/game/viz/<x>.py |
proteus/web/server.py |
proteus/web/local/server.py |
proteus/web/static/index.html |
proteus/web/local/static/index.html |
import proteus.web.server |
import proteus.web.local.server |
python -m proteus.web |
python -m proteus.web.local (old command still works via a shim) |
proteus/cli.py |
proteus/cli/ package (cli/__init__.py, cli/parser.py, cli/commands/{run,play,memory,replay,compare,list_scenarios}.py) |
Done (CP0โCP3)
- CP0 โ
proteuspackage skeleton +pyproject.toml(hatchling, pydantic/numpy/pyyaml, pytest config);arc_gridvendored byte-for-byte intoproteus/game/engine/arcengine/(verifieddiff -rempty; firewall holds โ zerosquid_game/proteuscross-imports inside it). - CP1 โ
gridfamily ported withsquid_game.* โ proteus.*renames only:difficulty,scenario(ABC + registry),game(MotiveGridGame),ascii_view,scenarios/predator_evade. Registry self-populates onimport proteus.game.scenarios. Acceptance gate locked: deterministic EASY handover (focal (3,3) / predator (5,3)) and the diagnostic invariantoptimal "up" โ habit "left". - CP2 โ slim
AgentABC + frozenActResult;extract_actionparser;VanillaAgent(act + optional probe, reasoning viaparse_thinking_tags); providersbase+thinking_utils(pure ports) + newFakeProvider(scripted, offline). - CP3 โ lean pydantic
TurnTrace/SessionTrace(outcomeisLiteral["survived","eliminated"]);compute_metrics;SessionRunnerorchestrator (Cut replay โ handover โ N-turn play โ scoredSessionTrace). Full offline end-to-end with JSON(L) round-trip.
Done (CP4 โ real-provider CLI; plan: docs/superpowers/plans/2026-06-01-proteus-arena-cli.md)
- Provider factory (
proteus/providers/factory.py) โmake_provider("name:model")+available_providers(). Lazy string registry(module_path, class_name): importingproteus.providersor building aFakeProvidernever imports an SDK (offline invariant holds).fake:<model>โ offlineFakeProvider;partition(":")so model ids with:(ollama) survive. - Cloud providers ported (openai / anthropic_provider / gemini / ollama_cloud) by import-rename
ONLY (
squid_game.* โ proteus.*); verified by file-text tests (SDKs not installed offline). Class names match the registry. local/mlx/cuda providers not ported (deferred โ need a local server / Apple-silicon / CUDA; off the CP4 "one cheap model" gate). - JSONL trace I/O (
proteus/game/runtime/io.py) โappend_trace/read_traces(one SessionTrace per line; creates parent dirs; skips blank lines). - CLI (
proteus/cli/package,proteus/__main__.py) โrun/list-scenarios/replay.runvalidates--model(unknown โ stderr+exit 2) AND--scenario(unknown โ stderr+exit 2, before any session runs);replayprints per-turn action vs motive/habit + metrics.python -m proteus.
Real-model smoke (CP4 acceptance gate โ passed)
- Model:
ollama:gpt-oss:120b-cloud(Ollama Cloud, native/api/chat). seed 42 / easy / play-turns 10 / probe on. - Outcome:
survived | motive_reading_accuracy=70.0% | reactivity_index=75.0%(survival_fraction 100%, first_divergence_turn 3). Real reasoning captured (3.9kโ13.5k chars/turn). Confirms the full chain CLIโagentโOllamaCloudProviderโollama.com and the lazy-imported port. - How it was run (repeatable):
.venvis kept SDK-free. A throwaway venv at/tmp/proteus-smoke-venv(onlypydantic numpy pyyaml httpxโ ollama needs justhttpx) is run withPYTHONPATH=<repo> /tmp/proteus-smoke-venv/bin/python -m proteus run --model ollama:gpt-oss:120b-cloud ....runs/and.envare gitignored (trace + key never committed). - Key gotcha (carry forward): the
.envOLLAMA_API_KEYvalue has a leading space + trailing extra tokens; the real key is the first whitespace-delimited token (57 chars). Passing the whole value โ 401. The smoke extracts... | awk '{print $1}'. Cleaning the.envline (key only) would remove the need for that.OLLAMA_API_KEY2exists but is not authorized for use.
Done (CP4.5 โ per-turn LLM-response accounting; no separate plan, executed in-session)
Motivation: the smoke surfaced that token accounting and raw text were dropped (thinking_tokens=0
every turn, no raw_text). Fixed via TDD 3-gate, all ADDITIVE except the probe return-type change:
- Act path (
1c18fa9):ActResultgainsinput_tokens/output_tokens/thinking_tokens;VanillaAgent.act()propagates them fromCompletionResultwiththinking_tokens = result.thinking_tokens or <inline-think parser count>(provider count preferred).TurnTracegainsraw_text/input_tokens/output_tokens(itsthinking_tokensfield is now populated);SessionRunner.run()wires them. - Probe path (
f95852b): new frozenProbeResult(answer, reasoning, raw_text, input/output/thinking tokens);Agent.probereturns it (wasstr);TurnTracegainsprobe_reasoning/probe_raw_text/ probe_{input,output,thinking}_tokens;SessionRunnerwires them (probe_a = probe.answer). Probereasoningfalls back to""(not the answer โ a probe answer is not its own reasoning). A regression test proves probe-side and act-side token fields are not cross-wired (probe=3, act=5).
Done (CP5 โ human play + trace visualization; plan: docs/superpowers/plans/2026-06-01-proteus-arena-viz-humanplay.md)
Executed task-by-task via subagent-driven-development (3-gate: implement โ spec review โ code-quality review). All offline/headless; no network, no display, no provider SDK added.
HumanAgent(proteus/game/agents/human.py, Tasks 1โ2) โ implements theAgentABC with I/O injection (input_fn/output_fn, resolved frombuiltinsat construction so tests drive it headlessly).actparses canonical actions + WASD shortcuts (w/a/s/d), is case/whitespace- insensitive, re-prompts on invalid input;probereturns a typedProbeResult;name="human". Routed through the unmodifiedSessionRunner, so a human trace is schema- AND value-identical to an LLM trace under the same actions (onlymodeldiffers). Locked bytests/runtime/test_human_comparability.py(Task 3, spec ยง10).proteus/game/viz/(Tasks 4โ6) โreconstruct.pyrebuilds pixel frames by deterministically replaying the trace through the real engine (scenario, seed, difficulty+ recorded actions), with double self-verify (every Cut frame's ASCII == storedcut_frames; sprite positions before each turn == storedfocal_pos/predator_pos) โ raisesTraceReconstructionErroron any divergence.terminal.pyrenders truecolor (24-bit ANSI) block grids + a side panel (action/motive/habit/reward/ tokens + reasoning excerpt, from the CP4.5 enrichment).png.pywrites per-frameframe_NNN.pngvia matplotlib Agg (imported lazily insidewrite_pngs).- CLI (Tasks 8โ9) โ
playsubcommand (human via stdin; probe OFF by default, opt-in--probe; optional--out) +replay --visual/--png DIR/--fps(text replay stays the default, CP4 non-destructive; viz imported lazily only when a visual flag is set). - Offline import invariant made real (Task 7, user-approved root-cause fix):
proteus/game/engine/arcengine/ rendering.py's top-leveltry: import matplotlibwas moved lazily insiderender_frames, so importing the engine /proteus.game.vizno longer pulls matplotlib in ANY ordering.tests/viz/ test_import_safety.pyguards it. (This intentionally diverges from the untracked top-levelarc_grid/vendored copy โ that copy is not on the import path.)
CP5 acceptance demo (passed)
- Full suite:
.venv/bin/python -m pytest -qโ 103 passed, offline, no display. - Comparability: human (
runs/cp5_human.jsonl) and fake-LLM (runs/cp5_llm.jsonl) at seed 42 / play-turns 6, both committing the same action (CLIfake:emitsACTION: stay, so the human was driven withstayto match) โ identicalcut_frames, per-turnaction, andmotive_actionanswer keys (['up','up']); differ only inmodel(humanvsdemo). NOTE: the plan's demo drove the human withupvs astay-playing fake LLM โ that necessarily diverges (different actions โ different world), so the demo was corrected to use a matching action; the comparability invariant itself (identical actions โ identical trace) is unchanged and unit-tested. - Visual:
replay --visual --fps 0renders the truecolor grid + side panel;replay --png runs/cp5_frameswrote 5 non-zeroframe_*.png(cut_length 2 + 1 initial + 2 play turns).
Done (CP6 โ difficulty layouts + category eval + human-baseline harness; plan: docs/superpowers/plans/2026-06-02-proteus-cp6-difficulty-eval-baseline.md; design: docs/superpowers/specs/2026-06-02-proteus-cp6-difficulty-eval-baseline-design.md)
Executed task-by-task via subagent-driven-development (3-gate: implement โ spec review โ code-quality review). All offline/headless; no network, no display, no provider SDK added. Full suite: 127 passed (103 CP5 baseline + 24 CP6 additions).
- Section A โ difficulty layouts (Tasks 1โ3):
Scenario.build_level(rng, difficulty)now takes the band;record_focal_move/safety_distancepromoted to concrete defaults on theScenarioABC (resolves a CP5 deferred item).predator_evadedispatches a hand-authored per-difficulty layout table (_Layoutdataclass +_LAYOUTS): EASY 8ร8 (byte-for-byte unchanged), MEDIUM 10ร10, HARD 12ร12 (L-trap), EXPERT 12ร12 (forked corridor).build_levelsets the instance attrself.grid_sizeso the camera/within_boundssize per-band. Golden testtests/grid/test_difficulty_layouts.pyis the oracle: per band it locks determinism + the handover + the diagnostic invariantoptimal_action != habit_actionwithhabit == "left"blocked. Candidate coordinates satisfied the invariant with no tuning (all four bands: optimal=upโ habit=left).rules_textde-hard-coded ("8x8 grid" โ "a grid") so it no longer lies on the larger bands. - Section B โ category-specific eval (Tasks 4โ6): per-turn reward ownership moved into the scenario. New abstract
Scenario.step_reward(game, action, blocked, focal_before, predator_before);SessionRunner._reward+ its 5 constants deleted (delegates now). predator_evade survival reward = BFS-distance gain measured against the predator's pre-move cell (awayโpositivefloat(d_after โ d_before), towardโnegative, wall-hitโโ3.0, terminalโ50/+50), isolating the agent's own move from the chase. Newproteus/game/metrics/rollout.pyoptimal_rollout(...)(engine-replay, same pattern asgame/viz/reconstruct, importsgame.scenariosonly).metrics.pygains four additive keys:away_move_fraction(survival headline),mean_step_reward,trajectory_agreement(vs the optimal rollout, denominator = all played turns),final_distance_gap. Existing four metrics retained;test_human_comparability'sset(h.metrics)==set(v.metrics)still holds. - Section C โ human-baseline harness (Tasks 7โ8): pure
proteus/game/metrics/aggregate.pyaggregate_traces(traces) โ {(model, difficulty): {"n", "metrics": means}}(no I/O, union-of-keys robust). New CLIproteus compare <trace.jsonl>โฆ [--out summary.json](rc 2 not-found / rc 1 empty / rc 0, mirrorsreplay; JSON keys stringified"model|difficulty").
CP6 acceptance demo (passed)
- Full suite:
.venv/bin/python -m pytest -qโ 127 passed, offline, no display. - Per-difficulty smoke:
proteus run โฆ --difficulty {easy,medium,hard,expert} --model fake:demo --seed 42 --play-turns 6 --no-probethenproteus compare โฆโ fourmodel=demo difficulty=<band> n=1groups, each with the 8 metric means (incl.away_move_fraction/trajectory_agreement/final_distance_gap), summary written toruns/cp6_summary.json. Values differ by band (e.g. EXPERTtrajectory_agreement=66.67,survival_fraction=50.0vs others50.0/33.33), confirming the layouts/metrics are genuinely band-sensitive. - Invariant acceptance: all four bands print
optimal=up habit=leftwith distinct grids(8,8)/(10,10)/(12,12)/(12,12)โoptimal != habiteverywhere. - Note: the
fake:demoprovider plays a fixed non-optimal action (somotive_reading_accuracy=0); the smoke validates the deterministic pipeline + metric surfacing, not agent skill.
Done (Web interactive play โ color-grid browser arena; plan: docs/superpowers/plans/2026-06-02-proteus-web-interactive-play.md; design: docs/superpowers/specs/2026-06-02-proteus-web-interactive-play-design.md)
Executed task-by-task via subagent-driven-development (implement โ spec review โ code-quality review per task). The web slice alone was 149 passed (127 CP6 baseline + 22 new web/runtime tests); after merging with the concurrent CP7/CP8 work the combined suite is 193 passed. Merge reconciliation: the SessionRunnerโ_session_core delegation refactor was kept as the basis and CP7 memory + CP8 persona were threaded through the shared helpers, so both SessionRunner and InteractiveSession (and thus web play) now emit CP8-rich traces from one source of truth.
- Stdlib-only color-grid web play โ launch with
python -m proteus.web.local(--host/--port, defaults to127.0.0.1:8000;python -m proteus.webstill works via a shim). A single staticindex.html(vanilla JS, no build, no CDN) renders the engine's integer grids in color viaproteus.game.engine.rendering.COLOR_MAP. Zero new dependencies โ the.venvstays SDK-free/offline. - Shared session core โ extracted
SessionRunner's per-session mechanics intoproteus/game/runtime/_session_core.py(pure refactor, no behavior change โ the existing tests stayed green at the same count). BothSessionRunnerand the newInteractiveSessionbuild their traces from these same helpers. InteractiveSession(proteus/game/runtime/interactive.py) โ a threadless, stepwise driver (state()/step()/finish()) that advances one turn per HTTP request. It is pinned toSessionRunner(HumanAgent)by a golden equivalence test (tests/runtime/test_interactive_equivalence.py) asserting fullmodel_dump()equality for the same action sequence โ the two paths cannot drift.- Fairness split (carried from CP5) โ the live
stateexposes only the color grid + available actions (no reward/optimal/habit), exactly what the LLM sees; the full per-turn disclosure (your action vs optimal vs habit, reward, 8 metrics) appears inreviewonly once the game is over. Asserted in both theInteractiveSessionand server tests. - Always-saved trace โ every finished game appends a
runs/web_<scenario>_<difficulty>.jsonlline (override via the/finish{out}body) that is schema-identical to LLM/CLI traces. Verified end-to-end: a web-produced trace is accepted by bothproteus replayandproteus compare(shows up as amodel=humanbaseline group). - HTTP API (stdlib
http.server) โGET /(page),GET /config,POST /session,GET /session/{id},POST /session/{id}/act,POST /session/{id}/finish; a pure socket-freehandle_request(...)router (unit-tested without a socket) + in-memory registry; structured errors (400/404/409/500); offline import-safety guarded (tests/web/test_import_safety.py).
Done (Web LLM spectating โ watch an LLM play in color; plan: docs/superpowers/plans/2026-06-02-proteus-web-spectate.md; design: docs/superpowers/specs/2026-06-02-proteus-web-spectate-design.md)
Executed task-by-task via subagent-driven-development (TDD, commit-per-task). Full suite: 202 passed (193 baseline + 9 new spectate tests), offline, no display, no provider SDK added. A person can now watch an LLM play the arena in the browser: the color grid auto-advances turn by turn while a live panel shows the model's reasoning plus the per-turn optimal/habit/reward (rich disclosure โ a spectator is not the scored subject), with pause / step controls.
SpectateSession(proteus/game/runtime/spectate.py) โ a threadless, agent-driven sibling ofInteractiveSession.state()/advance()/finish()over the shared_session_core; the server (not a human) supplies each action by callingagent.act(...)once per HTTP request. It readsdefault_memoryviagetattr(built, "default_memory", None)(forward-compatible with pack_evade) and passesmemory=tobuild_observation. It is pinned toSessionRunner(VanillaAgent(same FakeProvider))by a golden equivalence test (tests/runtime/test_spectate_equivalence.py) asserting fullmodel_dump()equality โ the spectate path cannot drift from the runner path.- HTTP API โ
POST /spectate,POST /spectate/{id}/next,POST /spectate/{id}/finish,GET /spectate/{id};/confignow also returnsproviders+default_model(fake:demo). Providers/agents are lazy-imported inside the create handler (available_providerslocally in_config_payload), soimport proteus.web.local.serverstill pulls no provider SDK โ the offline import-safety guard is extended and stays green. Structured errors: unknown model โ 400, advance-after-done โ 409, provider failure โ 500 (no stack leak). - Frontend (
proteus/web/local/static/index.html) โ a Play/Spectate mode toggle + model input; in spectate mode the human pad/keys go inert, a live analysis panel renders the model's reasoning + a per-turn table (turn/action/optimal/habit/reward/=opt?), and an auto-play loop drives/nextwith Pause/Resume and Step. On end it shows SURVIVED/ELIMINATED and the saved trace path. - Trace โ spectated runs save a
proteus compare-compatible LLM-baseline trace (runs/web_spectate_*.jsonlby default) accepted byproteus replay/proteus compareas amodel=<provider model>group (verified e2e withfake:demo).
How to run
- Environment:
pythonis not on PATH; a gitignored venv lives at.venv(Python 3.12, pydantic v2 / numpy / pyyaml / pytest + matplotlib โ the latter now required by the viz/png tests). Full suite:.venv/bin/python -m pytest -qโ 148 passed, no network, no display. Provider SDKs are NOT installed in.venv(keeps the offline invariant structural); matplotlib runs Agg-only (headless), so it does not break the offline/no-display invariant. Recreate.venvwith:uv venv --python 3.12 .venv && uv pip install --python .venv/bin/python "pydantic>=2" "numpy>=1.26" "pyyaml>=6" "pytest>=8" "matplotlib>=3.8" - Human play (offline):
printf 'up\nup\n...' | .venv/bin/python -m proteus play --scenario predator_evade --seed 42 --play-turns 6 --out runs/h.jsonl(live view is the same ASCII the LLM sees). - Visual replay:
.venv/bin/python -m proteus replay runs/h.jsonl --visual --fps 0(truecolor terminal) or--png runs/frames(per-frame PNGs). Plainreplay runs/h.jsonlstays text-only. - Web interactive play (offline, stdlib-only):
.venv/bin/python -m proteus.web.local(--host/--port, defaults127.0.0.1:8000), then open the printedhttp://127.0.0.1:8000/โ play the color grid in the browser; the finished game appends a schema-identical trace toruns/web_<scenario>_<difficulty>.jsonl(replayable/comparable withproteus replay/compare). Zero new dependencies. - Offline CLI smoke:
.venv/bin/python -m proteus run --scenario predator_evade --model fake:demo --seed 42 --play-turns 5 --out runs/x.jsonl && .venv/bin/python -m proteus replay runs/x.jsonl - Difficulty bands: add
--difficulty {easy,medium,hard,expert}torun/play(EASY is the default, byte-for-byte the CP0โCP5 world). Human-baseline compare: collect matched human (play --out) and LLM (run --out) traces, then.venv/bin/python -m proteus compare h.jsonl llm.jsonl --out summary.jsonโ per-(model, difficulty)metric means + n (themodeltoken is the part AFTER the colon, e.g.fake:demoโdemo). - Memory pre-roll (CP7): generate a checkpoint then reuse it โ
.venv/bin/python -m proteus memory --scenario predator_evade --model fake:demo --seed 42 --memory-turns 8writesruns/memory/<model>/<stamp>.json;proteus run โฆ --memory latest(or--memory generateto do both in one shot,--memory <path>for a specific file, defaultnone) injects it at the handover.runs/memory/is gitignored. Korean usage guide:docs/USAGE-ko.md. - Real-model smoke: see the throwaway-venv recipe above (needs a provider SDK + key; never in
.venv).
Current state
A scored SessionTrace is produced offline (FakeProvider) AND against real cloud models via the CLI;
each turn persists full act + probe accounting to runs/*.jsonl. Single scenario only: predator_evade
(survival motive). Human play and trace visualization now exist (CP5): proteus play runs a human
through the same SessionRunner (comparable trace), and proteus replay --visual/--png renders any
trace as truecolor terminal frames or PNGs by deterministically replaying the engine. proteus replay
with no flags still prints TEXT (CP4 behavior preserved). proteus/game/engine/arcengine/rendering.py is now actively
reused (palette + ANSI + RGB helpers) and its matplotlib import is lazy.
CP6 adds: four genuinely distinct hand-authored difficulty layouts (EASY unchanged), scenario-owned
category-specific reward (step_reward; survival = reward for moving away from the predator), an
optimal-rollout trajectory metric (game/metrics/rollout.py) + four new metric keys, and a proteus compare
human-baseline aggregation CLI. Still a single scenario (predator_evade, survival motive) โ the
step_reward/safety_distance ABC surface is now open for CP7 to plug new motive categories in
without touching SessionRunner.
Done (CP7 โ LLM-generated memory pre-roll; design: docs/superpowers/specs/2026-06-02-proteus-cp7-memory-preroll-design.md; plan: docs/superpowers/plans/2026-06-02-proteus-cp7-memory-preroll.md)
Executed inline (executing-plans, TDD, commit-per-task). Full suite: 152 passed (127 baseline + 25
CP7), offline/headless; no provider SDK added to .venv. The original "CP7 = new motive category" idea
was renamed to CP8 when the user pivoted to this memory feature; curiosity/sociality stay deferred.
- Stage 1 โ
game/runtime/memory.py(pure pydantic + stdlib, no intra-runtime imports, liketrace.py):MemoryTurn/MemoryCheckpointmodels;save_checkpoint/load_checkpoint(single-file JSON atruns/memory/<safe(model)>/<stamp>.json);latest_for_model(lexical stamp = chronological);render_memory_block(pure observation renderer). Exported fromgame/runtime/__init__. - Stage 2 โ
game/runtime/memory_gen.py(engine-coupled, likegame/metrics/rollout.py):generate_memory(scenario, agent, *, difficulty, seed, memory_turns, model_name, clock)runs a full-info self-play episode (max_steps=memory_turns, no scripted Cut, no answer-key leakage), capturing frames/actions/reasoning โMemoryCheckpoint. Deterministic under an injectedclock. New scenario-sourcedScenario.memory_brief(concrete attr default"");predator_evadesets a transparent brief (discloses the BFS chase). This resolves the memory-prompt half of the old "source prompts from Scenario" item; the_PROBE_QUESTION/_ACTION_DIRECTIVEpredator-framing is still hard-coded (carried to CP8). - Stage 3 โ hybrid handover (
trace.py+session.py): additiveSessionTrace.memory_ref: str | None;SessionRunner(..., memory=None, memory_ref=None)prepends the memory block (+NOW โ this run so far:) to the turn-1 observation only, before the unchanged scripted Cut.step_reward, answer keys, metrics, and_session_core.pyuntouched โ diagnostic invariant + 127 preserved (pinned bytest_session_memory.py: with/without memory the per-turnmotive_action/habit_action/is_diagnosticandmetricsare identical; only the turn-1 observation grows). - Stage 4 โ CLI (
proteus/cli/package):proteus memory(generate+save a checkpoint) andproteus run --memory MODEwhereMODE โ {none(default), generate, latest, <path>}via a_resolve_memoryhelper.--memory-turns(default 10) and--memory-root(defaultruns/memory) on both. Exit codes mirrorrun(unknown model/scenario โ 2;latestwith no checkpoint โ 2).
CP7 acceptance smoke (passed)
- Offline end-to-end:
proteus memory โฆ fake:demoโproteus run โฆ --memory latestโ the turn-1 observation carries theMEMORYblock +NOWseparator + scriptedCut 1/2:in order;memory_refset. - Ollama (networked, manual):
ollama:gpt-oss:120b-cloud, seed 42 / easy.proteus memory --memory-turns 6โ the real model self-played 6 turns and survived (vs the fakestayagent which is eliminated in 2), checkpoint written toruns/memory/gpt-oss_120b-cloud/<stamp>.json. Thenrun --play-turns 8 --memory latestโ survived | motive_reading_accuracy=50% | reactivity_index=40%; turn-1 observation = 962 chars containing the 6-turn memory block + the scripted Cut;memory_ref="gpt-oss:120b-cloud@<stamp>"..venvstayed SDK-free (throwaway/tmp/proteus-smoke-venvwithhttpx; key = first whitespace token of the.envvalue, CP4 gotcha).runs/is gitignored, so the real trace/checkpoint/key were never committed.
Done (CP8 โ predator agentness eval + auto-GIF; Stages 0/1/2/4 implemented this session)
Design: docs/superpowers/specs/2026-06-02-proteus-predator-agentness-eval-design.md (user-authored).
Plan: docs/superpowers/plans/2026-06-02-proteus-cp8-agentness-eval.md. TDD-ready Stages 0, 1, 2, 4
are complete (additive, 152โ175 tests, full suite green). Stages 3 & 5 remain DESIGN-GATED (below).
Replaces the single survived signal with a three-layer agentness eval โ survival, distance
trajectory, and memory-persona maintenance (does the model continue the persona its self-memory
demonstrates, not just survive?). proteus replay/compare surface every new metric automatically.
- Stage 0 โ auto-GIF โ
proteus/game/viz/gif.py:write_gif(Pillow lazy) +run/playauto-render<out>.gifnext to the trace (default on,--no-gif). Verified: 5-frame 256ร256 animated GIF. - Stage 1 โ metric-only โ
per-turn
post_focal_pos/post_predator_pos/pre_bfs_distance/post_bfs_distance/agent_distance_deltaonTurnTrace; scenariomax_bfs_distance/agent_distance_delta(predator_evade impls; ABCNonedefaults); episode metricstime_to_capture/distance_auc/min_distance/near_capture_count; episodeturn_order(focal_then_predator)/capture_rule(same_cell)/horizon.away_move_fractionkept (already pre-predator-based viastep_reward). Mirrored in_session_core.make_turn_trace(web path). - Stage 2 โ persona reference โ
game/metrics/persona.py: hiddenPersonaWeights+R_w(reward_rw) +reference_actions(argmax set, tiesโall) +pressure; built-insrisk_averse/risk_seeking/survival_optimal. Per-turnreference_actions/reference_reward/model_reward/reward_regret/pressure+ episodepersona_weight_id; metricsaction_agreement/reward_regret/pressure_weighted_agreement/persona_drift_turn(present only when a persona ran โ additive).proteus run --persona <id>; weights never serialized / never in the prompt (verified no leak). - Stage 4 โ hidden-weight memory โ
generate_memory(persona=)โ deterministic persona demonstration (reference policy plays,agentmay beNone);MemoryCheckpoint.persona_weight_id(public id only).proteus memory --persona <id>;run --memory <ckpt> --persona <id>scores whether the model continues the demonstrated persona. Weight-agnostic transparent brief (no leakage).
Still TODO โ DESIGN-GATED (each needs superpowers:brainstorming + sub-spec BEFORE coding; do not
fabricate from the plan):
- Stage 3 โ simultaneous resolver (DESIGN-GATED):
planโresolveengine turn + crossing capture; changes capture goldens โ own brainstorm+sub-spec; keepfocal_then_predatorselectable via a--turn-orderflag (default unchanged). Adds per-turnsame_cell_capture/crossing_capture/capturedflags + refinestime_to_captureto the real captured turn. See plan Stage 3 for the 4 open decisions. - Stage 5 โ multi-feature personas (DESIGN-GATED): greed/compliance/cooperation need new scenario
features (resources/norms/social) โ this is where the previously-deferred "new motive category /
curiosity" work lands; needs its own spec. Predator-only must NOT over-interpret those personas (spec ยง9).
PersonaWeightsalready stubsresource_reward/norm_cost/social_weightfor this.
The VanillaAgent._ACTION_DIRECTIVE / SessionRunner._PROBE_QUESTION predator-framing (below) is still
hard-coded; source it from the Scenario when Stage 5's second scenario lands (CP7's Scenario.memory_brief
is the pattern). LLM-as-judge reasoning scoring + web UI / leaderboard remain deferred (spec ยง12).
Memory length today = --memory-turns (default 10; survived โ exactly N turns, captured โ fewer).
Done (pack_evade scenario โ 64ร64 open-field multi-cell evasion; plan: docs/superpowers/plans/2026-06-02-proteus-pack-evade-scenario.md; design: docs/superpowers/specs/2026-06-02-proteus-pack-evade-scenario-design.md)
Executed task-by-task with strict TDD in an isolated worktree branched from CP7/CP8 master. Full suite: 206 passed (193 baseline + 13 new: 7 scenario behaviour + 2 footprint bounds + 2 render_frame + 2 manual-memory). Offline, headless; zero new dependencies.
- New
pack_evadescenario (proteus/game/scenarios/pack_evade.py, registered alongside the untouchedpredator_evade): a 64ร64 open field, no walls; multi-cell sprites (predator 5ร5, focal 3ร3); eat = footprint AABB overlap (not adjacency); center-to-center Manhattan geometry โ no BFS / no O(Nยฒ) on 64ร64 (analytic max distance =(64-1)+(64-1) = 126). Pure evasion:habit_action == optimal_actionalways, so no diagnostic (is_diagnosticis False). - Engine stays single-focal. The 3-prey pack hunt (two caught) lives only in the
hand-authored handover memory; the live engine gained one additive change: footprint-aware
focal bounds in
MotiveGridGame.step()(_footprint_in_bounds) so a multi-cell focal keeps its full footprint on-grid. The 1ร1 path is byte-identical (regression-guarded) โpredator_evadeunchanged. Scenario.render_frame(game)hook (game/scenarios/base.pydefault = full ASCII;_session_core.render_asciidelegates).pack_evadeoverrides it with a compact one-line coordinate observation (no 4096-char map);predator_evade's observation/cut-frames are byte-identical to before.Scenario.default_memory(seed, difficulty)hook (defaultNone) + wiring:BuiltSession.default_memoryis computed inbuild_session;session.py/interactive.pyresolveself._memory or built.default_memory, so an explicitmemory=still overrides.pack_evade.default_memorygenerates the handover memory from a hidden persona weight vector (get_persona("risk_averse")) via the existing CP7/CP8memory_gen.generate_memory(..., persona=...)in its reference-policy mode โ the hidden policy actually plays a pack_evade episode and that self-play trajectory is the memory (per the agentness design,docs/agentness_game_design_from_paper.mdยง3-4). Only the publicpersona_weight_idis recorded; the raw weights never leak. Two compat shims (_is_free=in-bounds,_bfs_distance=Manhattan, since the field is wall-free) let the shared persona policy run on the open field;generate_memorynow renders frames viascenario.render_frame(compact on pack_evade, ASCII-identical on predator_evade). The CP7 LLM self-playmemory_genpath is otherwise untouched.- Trace schema-/value-compatible with the existing tooling: verified end-to-end with
proteus run --scenario pack_evade --model fake:demoโproteus replay(per-turn action vs optimal, habit==optimal, survival metrics) โproteus compare(model=demo difficulty=easy n=1). - Real-LLM smoke (passed):
ollama:gpt-oss:120b-cloudonpack_evadeโ both the CLIrun(4 turns, survived) and the web Spectate path (/spectateโ/nextรN, live ~3โ6.6k-char reasoning per turn, survived). The model reads the compact 64ร64 observation and the injected persona memory and produces valid actions. Run with the SDK-free.venvuntouched, via the throwaway/tmp/proteus-smoke-venv(httpx only) +PYTHONPATH=<repo>+ cleanedOLLAMA_API_KEY(first whitespace token of the.envvalue โ see the CP4 smoke gotcha).--no-gifsince the smoke venv has no matplotlib.
Merge note for master: this branch changed five shared files โ proteus/game/scenarios/base.py
(two additive ABC defaults), proteus/game/engine/grid.py (footprint bounds), proteus/game/runtime/_session_core.py
(render_ascii delegate + BuiltSession.default_memory), proteus/game/runtime/session.py and
proteus/game/runtime/interactive.py (effective-memory fallback). All changes are additive/back-compatible;
reconcile with any concurrent CP7/CP8 edits to those files at merge time.
Deferred items (carry forward)
- local/mlx/cuda providers un-ported (need local server / Apple-silicon / CUDA; out of CP4 scope).
VanillaAgent._ACTION_DIRECTIVEandSessionRunner._PROBE_QUESTIONhard-code the "predator" framing โ fine for the single-scenario slice; source these from theScenariowhen CP6+ adds scenarios.SessionRunner._provider_model_namereaches intoagent._providerby naming convention; consider an explicitmodel_nameon theAgentbase when more agent types land. (CP5'sHumanAgenthas no_provider, somodelcorrectly falls back toagent.name == "human".)- reasoning/probe are recorded but not scored (LLM-as-judge deferred per spec ยง12).
โ RESOLVED (CP6 Task 1): promoted to a non-abstract no-op default onScenario.record_focal_moveis not on the ABCScenario(alongsidesafety_distance), so multi-scenario callers never hitAttributeError.- CLI
--pngwithout thevizextra raises a rawModuleNotFoundErrortraceback instead of a clean rc=2 (every other expected CLI failure returns a structured code). Not reachable today (matplotlib is in.venvand no packagingextrasare defined yet). Add atry/except ImportErrorguard + companion test when packaging/extras land. Same pattern for surfacingTraceReconstructionErrorfromreplayon a corrupt trace, and a--fps-is-terminal-only help note. viz.reconstructterminal flag with emptytrace.turns: if a trace had zero played turns but a non-emptyoutcome, the final (cut) frame would show no terminal annotation. Not reachable today (SessionRunnerguards an empty play phase with aRuntimeError); add a guard/comment if the trace format ever permits empty-turns-with-outcome.
Open questions
- Package/project README rewrite (replace the deprecated persona-arena
depreated/README.MD). Trajectory metric:โ RESOLVED (CP6): addedfirst_divergence_turnis a coarse proxytrajectory_agreement(position step-agreement vs the optimal rollout) +final_distance_gap(spec ยง7 "๊ถค์ ์ผ์น").first_divergence_turnis kept additively; open whether to retire it once the new pair proves sufficient (CP6 design ยง8). Also open (CP6 design ยง5.2): exact reward grading (delta vs ยฑ1) is pinned by the property test, revisit if a future category needs a different shape.- If a 3rd LLM-call result type appears (e.g. a reflect call), extract a shared
_LLMCallResultbase for the commonreasoning/raw_text/{input,output,thinking}_tokensfields (ActResult/ProbeResult currently duplicate them โ acceptable for two). - Probe-vs-act-reasoning redundancy (deferred from CP5, spec-change territory): both
actandprobecapture reasoning per turn; whether the per-turn probe is worth its cost vs. reusing act reasoning is a measurement-design question for a spec revision, out of CP5 scope. - Optional live-color human play: CP5 keeps the human's live view as plain ASCII (fairness โ the
human reads exactly what the LLM reads). A truecolor live mode for human play (distinct from post-hoc
replay --visual) is possible but intentionally deferred to preserve baseline fairness.