Spaces:
Sleeping
Sleeping
| # PROTEUS — Handoff | |
| > Living handoff record. Updated at each checkpoint close. SSOT trio: | |
| > `docs/superpowers/specs/2026-06-01-proteus-arena-slice-design.md` (immutable design) + | |
| > the per-CP plan in `docs/superpowers/plans/` + this file. | |
| **Last updated:** 2026-06-02 — two parallel slices shipped: **`pack_evade`** scenario (64×64 open-field | |
| multi-cell evasion + persona weight-vector handover memory) and web **LLM spectating** (watch an LLM | |
| play the color grid with live reasoning/analysis + pause/step; `SpectateSession` pinned to | |
| `SessionRunner(VanillaAgent)` by a golden test; providers lazy-imported so offline import-safety holds). | |
| Both built in isolated worktrees and merged clean. Before them: CP7/CP8 (LLM-generated memory pre-roll + | |
| persona agentness eval) and Web interactive color-grid play; offline, headless. CP6 before that | |
| (difficulty layouts + category eval + `compare`). | |
| ## Restructure (2026-06-02): game/web layout | |
| The `proteus/` package was reorganised into a cleaner layout with **zero logic change** — 253 tests remain green. The top-level sub-packages are now: `game/` (engine, scenarios, agents, runtime, metrics, viz), `web/` (local server + arena skeleton), `shared/` (utilities), `cli/` (parser + per-command modules), and the unchanged `providers/`. Docs: design/SSOT at `docs/superpowers/specs/2026-06-02-proteus-game-web-restructure-design.md`; implementation plan at `docs/superpowers/plans/2026-06-02-proteus-game-web-restructure.md`. | |
| | Old path / import | New path / import | | |
| |---|---| | |
| | `proteus/grid/game.py` | `proteus/game/engine/grid.py` | | |
| | `proteus/grid/scenario.py` | `proteus/game/scenarios/base.py` | | |
| | `proteus/grid/scenarios/<x>.py` | `proteus/game/scenarios/<x>.py` | | |
| | `proteus/grid/difficulty.py` | `proteus/game/engine/difficulty.py` | | |
| | `proteus/grid/ascii_view.py` | `proteus/game/engine/ascii_view.py` | | |
| | `import proteus.grid` (registry side-effect) | `import proteus.game.scenarios` | | |
| | `proteus/arc_grid/` / `proteus.arc_grid.*` | `proteus/game/engine/` / `proteus.game.engine.*` (vendored engine at `proteus/game/engine/arcengine/`) | | |
| | `proteus/runtime/metrics.py` | `proteus/game/metrics/metrics.py` | | |
| | `proteus/runtime/persona.py` | `proteus/game/metrics/persona.py` | | |
| | `proteus/runtime/rollout.py` | `proteus/game/metrics/rollout.py` | | |
| | `proteus/runtime/aggregate.py` | `proteus/game/metrics/aggregate.py` | | |
| | `proteus/runtime/<other>.py` (session, _session_core, interactive, spectate, trace, io, memory, memory_gen, memory_policies) | `proteus/game/runtime/<x>.py` | | |
| | `proteus/agents/<x>.py` | `proteus/game/agents/<x>.py` | | |
| | `proteus/viz/<x>.py` / `proteus.viz` | `proteus/game/viz/<x>.py` | | |
| | `proteus/web/server.py` | `proteus/web/local/server.py` | | |
| | `proteus/web/static/index.html` | `proteus/web/local/static/index.html` | | |
| | `import proteus.web.server` | `import proteus.web.local.server` | | |
| | `python -m proteus.web` | `python -m proteus.web.local` (old command still works via a shim) | | |
| | `proteus/cli.py` | `proteus/cli/` package (`cli/__init__.py`, `cli/parser.py`, `cli/commands/{run,play,memory,replay,compare,list_scenarios}.py`) | | |
| ## Done (CP0–CP3) | |
| - **CP0** — `proteus` package skeleton + `pyproject.toml` (hatchling, pydantic/numpy/pyyaml, | |
| pytest config); `arc_grid` vendored **byte-for-byte** into `proteus/game/engine/arcengine/` (verified | |
| `diff -r` empty; firewall holds — zero `squid_game`/`proteus` cross-imports inside it). | |
| - **CP1** — `grid` family ported with `squid_game.* → proteus.*` renames only: `difficulty`, | |
| `scenario` (ABC + registry), `game` (`MotiveGridGame`), `ascii_view`, `scenarios/predator_evade`. | |
| Registry self-populates on `import proteus.game.scenarios`. Acceptance gate locked: deterministic EASY | |
| handover (`focal (3,3) / predator (5,3)`) and the diagnostic invariant `optimal "up" ≠ habit "left"`. | |
| - **CP2** — slim `Agent` ABC + frozen `ActResult`; `extract_action` parser; `VanillaAgent` | |
| (act + optional probe, reasoning via `parse_thinking_tags`); providers `base` + | |
| `thinking_utils` (pure ports) + new `FakeProvider` (scripted, offline). | |
| - **CP3** — lean pydantic `TurnTrace`/`SessionTrace` (`outcome` is `Literal["survived","eliminated"]`); | |
| `compute_metrics`; `SessionRunner` orchestrator (Cut replay → handover → N-turn play → | |
| scored `SessionTrace`). Full offline end-to-end with JSON(L) round-trip. | |
| ## Done (CP4 — real-provider CLI; plan: `docs/superpowers/plans/2026-06-01-proteus-arena-cli.md`) | |
| - **Provider factory** (`proteus/providers/factory.py`) — `make_provider("name:model")` + | |
| `available_providers()`. **Lazy** string registry `(module_path, class_name)`: importing | |
| `proteus.providers` or building a `FakeProvider` never imports an SDK (offline invariant holds). | |
| `fake:<model>` → offline `FakeProvider`; `partition(":")` so model ids with `:` (ollama) survive. | |
| - **Cloud providers ported** (openai / anthropic_provider / gemini / ollama_cloud) by import-rename | |
| ONLY (`squid_game.* → proteus.*`); verified by file-text tests (SDKs not installed offline). | |
| Class names match the registry. local/mlx/cuda providers **not ported** (deferred — need a local | |
| server / Apple-silicon / CUDA; off the CP4 "one cheap model" gate). | |
| - **JSONL trace I/O** (`proteus/game/runtime/io.py`) — `append_trace` / `read_traces` (one SessionTrace | |
| per line; creates parent dirs; skips blank lines). | |
| - **CLI** (`proteus/cli/` package, `proteus/__main__.py`) — `run` / `list-scenarios` / `replay`. | |
| `run` validates `--model` (unknown → stderr+exit 2) AND `--scenario` (unknown → stderr+exit 2, | |
| before any session runs); `replay` prints per-turn action vs motive/habit + metrics. `python -m proteus`. | |
| ### Real-model smoke (CP4 acceptance gate — passed) | |
| - **Model:** `ollama:gpt-oss:120b-cloud` (Ollama Cloud, native `/api/chat`). seed 42 / easy / | |
| play-turns 10 / probe on. | |
| - **Outcome:** `survived | motive_reading_accuracy=70.0% | reactivity_index=75.0%` | |
| (survival_fraction 100%, first_divergence_turn 3). Real reasoning captured (3.9k–13.5k chars/turn). | |
| Confirms the full chain CLI→agent→OllamaCloudProvider→ollama.com and the lazy-imported port. | |
| - **How it was run (repeatable):** `.venv` is kept **SDK-free**. A throwaway venv at | |
| `/tmp/proteus-smoke-venv` (only `pydantic numpy pyyaml httpx` — ollama needs just `httpx`) is run with | |
| `PYTHONPATH=<repo> /tmp/proteus-smoke-venv/bin/python -m proteus run --model ollama:gpt-oss:120b-cloud ...`. | |
| `runs/` and `.env` are gitignored (trace + key never committed). | |
| - **Key gotcha (carry forward):** the `.env` `OLLAMA_API_KEY` value has a leading space + trailing | |
| extra tokens; the *real* key is the **first whitespace-delimited token** (57 chars). Passing the | |
| whole value → 401. The smoke extracts `... | awk '{print $1}'`. Cleaning the `.env` line (key only) | |
| would remove the need for that. `OLLAMA_API_KEY2` exists but is **not authorized** for use. | |
| ## Done (CP4.5 — per-turn LLM-response accounting; no separate plan, executed in-session) | |
| Motivation: the smoke surfaced that token accounting and raw text were dropped (`thinking_tokens=0` | |
| every turn, no `raw_text`). Fixed via TDD 3-gate, all ADDITIVE except the probe return-type change: | |
| - **Act path** (`1c18fa9`): `ActResult` gains `input_tokens/output_tokens/thinking_tokens`; | |
| `VanillaAgent.act()` propagates them from `CompletionResult` with | |
| `thinking_tokens = result.thinking_tokens or <inline-think parser count>` (provider count preferred). | |
| `TurnTrace` gains `raw_text/input_tokens/output_tokens` (its `thinking_tokens` field is now populated); | |
| `SessionRunner.run()` wires them. | |
| - **Probe path** (`f95852b`): new frozen `ProbeResult(answer, reasoning, raw_text, input/output/thinking | |
| tokens)`; `Agent.probe` returns it (was `str`); `TurnTrace` gains `probe_reasoning/probe_raw_text/ | |
| probe_{input,output,thinking}_tokens`; `SessionRunner` wires them (`probe_a = probe.answer`). | |
| Probe `reasoning` falls back to `""` (not the answer — a probe answer is not its own reasoning). | |
| A regression test proves probe-side and act-side token fields are **not cross-wired** (probe=3, act=5). | |
| ## Done (CP5 — human play + trace visualization; plan: `docs/superpowers/plans/2026-06-01-proteus-arena-viz-humanplay.md`) | |
| Executed task-by-task via subagent-driven-development (3-gate: implement → spec review → code-quality review). All offline/headless; no network, no display, no provider SDK added. | |
| - **`HumanAgent`** (`proteus/game/agents/human.py`, Tasks 1–2) — implements the `Agent` ABC with **I/O | |
| injection** (`input_fn`/`output_fn`, resolved from `builtins` at construction so tests drive it | |
| headlessly). `act` parses canonical actions + WASD shortcuts (`w/a/s/d`), is case/whitespace- | |
| insensitive, re-prompts on invalid input; `probe` returns a typed `ProbeResult`; `name="human"`. | |
| Routed through the **unmodified** `SessionRunner`, so a human trace is schema- AND value-identical | |
| to an LLM trace under the same actions (only `model` differs). Locked by | |
| `tests/runtime/test_human_comparability.py` (Task 3, spec §10). | |
| - **`proteus/game/viz/`** (Tasks 4–6) — `reconstruct.py` rebuilds pixel frames by **deterministically | |
| replaying** the trace through the real engine (`scenario, seed, difficulty` + recorded actions), | |
| with **double self-verify** (every Cut frame's ASCII == stored `cut_frames`; sprite positions before | |
| each turn == stored `focal_pos/predator_pos`) → raises `TraceReconstructionError` on any divergence. | |
| `terminal.py` renders truecolor (24-bit ANSI) block grids + a side panel (action/motive/habit/reward/ | |
| tokens + reasoning excerpt, from the CP4.5 enrichment). `png.py` writes per-frame `frame_NNN.png` via | |
| matplotlib **Agg** (imported **lazily** inside `write_pngs`). | |
| - **CLI** (Tasks 8–9) — `play` subcommand (human via stdin; **probe OFF by default**, opt-in `--probe`; | |
| optional `--out`) + `replay --visual/--png DIR/--fps` (text replay stays the **default**, CP4 | |
| non-destructive; viz imported lazily only when a visual flag is set). | |
| - **Offline import invariant made real** (Task 7, user-approved root-cause fix): `proteus/game/engine/arcengine/ | |
| rendering.py`'s top-level `try: import matplotlib` was moved **lazily inside `render_frames`**, so | |
| importing the engine / `proteus.game.viz` no longer pulls matplotlib in ANY ordering. `tests/viz/ | |
| test_import_safety.py` guards it. (This intentionally diverges from the untracked top-level `arc_grid/` | |
| vendored copy — that copy is not on the import path.) | |
| ### CP5 acceptance demo (passed) | |
| - **Full suite:** `.venv/bin/python -m pytest -q` → **103 passed**, offline, no display. | |
| - **Comparability:** human (`runs/cp5_human.jsonl`) and fake-LLM (`runs/cp5_llm.jsonl`) at seed 42 / | |
| play-turns 6, **both committing the same action** (CLI `fake:` emits `ACTION: stay`, so the human was | |
| driven with `stay` to match) → **identical** `cut_frames`, per-turn `action`, and `motive_action` | |
| answer keys (`['up','up']`); differ only in `model` (`human` vs `demo`). NOTE: the plan's demo drove | |
| the human with `up` vs a `stay`-playing fake LLM — that necessarily diverges (different actions → | |
| different world), so the demo was corrected to use a matching action; the comparability **invariant** | |
| itself (identical actions → identical trace) is unchanged and unit-tested. | |
| - **Visual:** `replay --visual --fps 0` renders the truecolor grid + side panel; `replay --png | |
| runs/cp5_frames` wrote **5 non-zero** `frame_*.png` (cut_length 2 + 1 initial + 2 play turns). | |
| ## Done (CP6 — difficulty layouts + category eval + human-baseline harness; plan: `docs/superpowers/plans/2026-06-02-proteus-cp6-difficulty-eval-baseline.md`; design: `docs/superpowers/specs/2026-06-02-proteus-cp6-difficulty-eval-baseline-design.md`) | |
| Executed task-by-task via subagent-driven-development (3-gate: implement → spec review → code-quality review). All offline/headless; no network, no display, no provider SDK added. **Full suite: 127 passed** (103 CP5 baseline + 24 CP6 additions). | |
| - **Section A — difficulty layouts** (Tasks 1–3): `Scenario.build_level(rng, difficulty)` now takes the band; `record_focal_move`/`safety_distance` promoted to concrete defaults on the `Scenario` ABC (resolves a CP5 deferred item). `predator_evade` dispatches a **hand-authored per-difficulty layout table** (`_Layout` dataclass + `_LAYOUTS`): EASY 8×8 (byte-for-byte unchanged), MEDIUM 10×10, HARD 12×12 (L-trap), EXPERT 12×12 (forked corridor). `build_level` sets the **instance** attr `self.grid_size` so the camera/`within_bounds` size per-band. Golden test `tests/grid/test_difficulty_layouts.py` is the oracle: per band it locks determinism + the handover + the diagnostic invariant `optimal_action != habit_action` with `habit == "left"` blocked. Candidate coordinates satisfied the invariant with **no tuning** (all four bands: optimal=`up` ≠ habit=`left`). `rules_text` de-hard-coded ("8x8 grid" → "a grid") so it no longer lies on the larger bands. | |
| - **Section B — category-specific eval** (Tasks 4–6): per-turn reward **ownership moved into the scenario**. New abstract `Scenario.step_reward(game, action, blocked, focal_before, predator_before)`; `SessionRunner._reward` + its 5 constants **deleted** (delegates now). predator_evade survival reward = BFS-distance gain measured against the predator's **pre-move** cell (away→positive `float(d_after − d_before)`, toward→negative, wall-hit→`−3.0`, terminal `−50/+50`), isolating the agent's own move from the chase. New `proteus/game/metrics/rollout.py` `optimal_rollout(...)` (engine-replay, same pattern as `game/viz/reconstruct`, imports `game.scenarios` only). `metrics.py` gains four additive keys: **`away_move_fraction`** (survival headline), `mean_step_reward`, `trajectory_agreement` (vs the optimal rollout, denominator = all played turns), `final_distance_gap`. Existing four metrics retained; `test_human_comparability`'s `set(h.metrics)==set(v.metrics)` still holds. | |
| - **Section C — human-baseline harness** (Tasks 7–8): pure `proteus/game/metrics/aggregate.py` `aggregate_traces(traces) → {(model, difficulty): {"n", "metrics": means}}` (no I/O, union-of-keys robust). New CLI `proteus compare <trace.jsonl>… [--out summary.json]` (rc 2 not-found / rc 1 empty / rc 0, mirrors `replay`; JSON keys stringified `"model|difficulty"`). | |
| ### CP6 acceptance demo (passed) | |
| - **Full suite:** `.venv/bin/python -m pytest -q` → **127 passed**, offline, no display. | |
| - **Per-difficulty smoke:** `proteus run … --difficulty {easy,medium,hard,expert} --model fake:demo --seed 42 --play-turns 6 --no-probe` then `proteus compare …` → four `model=demo difficulty=<band> n=1` groups, each with the 8 metric means (incl. `away_move_fraction`/`trajectory_agreement`/`final_distance_gap`), summary written to `runs/cp6_summary.json`. Values differ by band (e.g. EXPERT `trajectory_agreement=66.67`, `survival_fraction=50.0` vs others `50.0`/`33.33`), confirming the layouts/metrics are genuinely band-sensitive. | |
| - **Invariant acceptance:** all four bands print `optimal=up habit=left` with distinct grids `(8,8)/(10,10)/(12,12)/(12,12)` — `optimal != habit` everywhere. | |
| - **Note:** the `fake:demo` provider plays a fixed non-optimal action (so `motive_reading_accuracy=0`); the smoke validates the deterministic pipeline + metric surfacing, not agent skill. | |
| ## Done (Web interactive play — color-grid browser arena; plan: `docs/superpowers/plans/2026-06-02-proteus-web-interactive-play.md`; design: `docs/superpowers/specs/2026-06-02-proteus-web-interactive-play-design.md`) | |
| Executed task-by-task via subagent-driven-development (implement → spec review → code-quality review per task). The web slice alone was **149 passed** (127 CP6 baseline + 22 new web/runtime tests); after merging with the concurrent CP7/CP8 work the **combined suite is 193 passed**. Merge reconciliation: the `SessionRunner`→`_session_core` delegation refactor was kept as the basis and CP7 memory + CP8 persona were threaded through the shared helpers, so both `SessionRunner` and `InteractiveSession` (and thus web play) now emit CP8-rich traces from one source of truth. | |
| - **Stdlib-only color-grid web play** — launch with `python -m proteus.web.local` (`--host`/`--port`, defaults to `127.0.0.1:8000`; `python -m proteus.web` still works via a shim). A single static `index.html` (vanilla JS, no build, no CDN) renders the engine's integer grids in color via `proteus.game.engine.rendering.COLOR_MAP`. **Zero new dependencies** — the `.venv` stays SDK-free/offline. | |
| - **Shared session core** — extracted `SessionRunner`'s per-session mechanics into `proteus/game/runtime/_session_core.py` (pure refactor, no behavior change — the existing tests stayed green at the same count). Both `SessionRunner` and the new `InteractiveSession` build their traces from these same helpers. | |
| - **`InteractiveSession`** (`proteus/game/runtime/interactive.py`) — a threadless, stepwise driver (`state()`/`step()`/`finish()`) that advances one turn per HTTP request. It is pinned to `SessionRunner(HumanAgent)` by a **golden equivalence test** (`tests/runtime/test_interactive_equivalence.py`) asserting full `model_dump()` equality for the same action sequence — the two paths cannot drift. | |
| - **Fairness split** (carried from CP5) — the live `state` exposes only the color grid + available actions (no reward/optimal/habit), exactly what the LLM sees; the full per-turn disclosure (your action vs optimal vs habit, reward, 8 metrics) appears in `review` only once the game is over. Asserted in both the `InteractiveSession` and server tests. | |
| - **Always-saved trace** — every finished game appends a `runs/web_<scenario>_<difficulty>.jsonl` line (override via the `/finish` `{out}` body) that is schema-identical to LLM/CLI traces. Verified end-to-end: a web-produced trace is accepted by both `proteus replay` and `proteus compare` (shows up as a `model=human` baseline group). | |
| - **HTTP API** (stdlib `http.server`) — `GET /` (page), `GET /config`, `POST /session`, `GET /session/{id}`, `POST /session/{id}/act`, `POST /session/{id}/finish`; a pure socket-free `handle_request(...)` router (unit-tested without a socket) + in-memory registry; structured errors (400/404/409/500); offline import-safety guarded (`tests/web/test_import_safety.py`). | |
| ## Done (Web LLM spectating — watch an LLM play in color; plan: `docs/superpowers/plans/2026-06-02-proteus-web-spectate.md`; design: `docs/superpowers/specs/2026-06-02-proteus-web-spectate-design.md`) | |
| Executed task-by-task via subagent-driven-development (TDD, commit-per-task). **Full suite: 202 passed** (193 baseline + 9 new spectate tests), offline, no display, no provider SDK added. A person can now **watch an LLM play the arena in the browser**: the color grid auto-advances turn by turn while a live panel shows the model's reasoning plus the per-turn optimal/habit/reward (rich disclosure — a spectator is not the scored subject), with **pause / step** controls. | |
| - **`SpectateSession`** (`proteus/game/runtime/spectate.py`) — a threadless, **agent-driven** sibling of `InteractiveSession`. `state()`/`advance()`/`finish()` over the shared `_session_core`; the server (not a human) supplies each action by calling `agent.act(...)` once per HTTP request. It reads `default_memory` via `getattr(built, "default_memory", None)` (forward-compatible with pack_evade) and passes `memory=` to `build_observation`. It is pinned to `SessionRunner(VanillaAgent(same FakeProvider))` by a **golden equivalence test** (`tests/runtime/test_spectate_equivalence.py`) asserting full `model_dump()` equality — the spectate path cannot drift from the runner path. | |
| - **HTTP API** — `POST /spectate`, `POST /spectate/{id}/next`, `POST /spectate/{id}/finish`, `GET /spectate/{id}`; `/config` now also returns `providers` + `default_model` (`fake:demo`). Providers/agents are **lazy-imported inside the create handler** (`available_providers` locally in `_config_payload`), so `import proteus.web.local.server` still pulls no provider SDK — the offline import-safety guard is extended and stays green. Structured errors: unknown model → 400, advance-after-done → 409, provider failure → 500 (no stack leak). | |
| - **Frontend** (`proteus/web/local/static/index.html`) — a Play/Spectate mode toggle + model input; in spectate mode the human pad/keys go inert, a live analysis panel renders the model's reasoning + a per-turn table (turn/action/optimal/habit/reward/=opt?), and an auto-play loop drives `/next` with **Pause/Resume** and **Step**. On end it shows SURVIVED/ELIMINATED and the saved trace path. | |
| - **Trace** — spectated runs save a `proteus compare`-compatible LLM-baseline trace (`runs/web_spectate_*.jsonl` by default) accepted by `proteus replay` / `proteus compare` as a `model=<provider model>` group (verified e2e with `fake:demo`). | |
| ## How to run | |
| - Environment: `python` is **not** on PATH; a gitignored venv lives at `.venv` (Python 3.12, | |
| pydantic v2 / numpy / pyyaml / pytest **+ matplotlib** — the latter now required by the viz/png | |
| tests). **Full suite:** `.venv/bin/python -m pytest -q` → **148 passed**, no network, no display. | |
| Provider SDKs are NOT installed in `.venv` (keeps the offline invariant structural); matplotlib runs | |
| **Agg-only** (headless), so it does not break the offline/no-display invariant. Recreate `.venv` with: | |
| `uv venv --python 3.12 .venv && uv pip install --python .venv/bin/python "pydantic>=2" "numpy>=1.26" "pyyaml>=6" "pytest>=8" "matplotlib>=3.8"` | |
| - Human play (offline): `printf 'up\nup\n...' | .venv/bin/python -m proteus play --scenario | |
| predator_evade --seed 42 --play-turns 6 --out runs/h.jsonl` (live view is the same ASCII the LLM sees). | |
| - Visual replay: `.venv/bin/python -m proteus replay runs/h.jsonl --visual --fps 0` (truecolor terminal) | |
| or `--png runs/frames` (per-frame PNGs). Plain `replay runs/h.jsonl` stays text-only. | |
| - Web interactive play (offline, stdlib-only): `.venv/bin/python -m proteus.web.local` (`--host`/`--port`, | |
| defaults `127.0.0.1:8000`), then open the printed `http://127.0.0.1:8000/` — play the color grid in the | |
| browser; the finished game appends a schema-identical trace to `runs/web_<scenario>_<difficulty>.jsonl` | |
| (replayable/comparable with `proteus replay`/`compare`). Zero new dependencies. | |
| - Offline CLI smoke: `.venv/bin/python -m proteus run --scenario predator_evade --model fake:demo | |
| --seed 42 --play-turns 5 --out runs/x.jsonl && .venv/bin/python -m proteus replay runs/x.jsonl` | |
| - Difficulty bands: add `--difficulty {easy,medium,hard,expert}` to `run`/`play` (EASY is the default, | |
| byte-for-byte the CP0–CP5 world). Human-baseline compare: collect matched human (`play --out`) and LLM | |
| (`run --out`) traces, then `.venv/bin/python -m proteus compare h.jsonl llm.jsonl --out summary.json` | |
| → per-`(model, difficulty)` metric means + n (the `model` token is the part AFTER the colon, e.g. | |
| `fake:demo` → `demo`). | |
| - Memory pre-roll (CP7): generate a checkpoint then reuse it — | |
| `.venv/bin/python -m proteus memory --scenario predator_evade --model fake:demo --seed 42 --memory-turns 8` | |
| writes `runs/memory/<model>/<stamp>.json`; `proteus run … --memory latest` (or `--memory generate` to do | |
| both in one shot, `--memory <path>` for a specific file, default `none`) injects it at the handover. | |
| `runs/memory/` is gitignored. Korean usage guide: `docs/USAGE-ko.md`. | |
| - Real-model smoke: see the throwaway-venv recipe above (needs a provider SDK + key; never in `.venv`). | |
| ## Current state | |
| A scored `SessionTrace` is produced offline (FakeProvider) AND against real cloud models via the CLI; | |
| each turn persists full act + probe accounting to `runs/*.jsonl`. Single scenario only: `predator_evade` | |
| (survival motive). **Human play and trace visualization now exist** (CP5): `proteus play` runs a human | |
| through the same `SessionRunner` (comparable trace), and `proteus replay --visual/--png` renders any | |
| trace as truecolor terminal frames or PNGs by deterministically replaying the engine. `proteus replay` | |
| with no flags still prints TEXT (CP4 behavior preserved). `proteus/game/engine/arcengine/rendering.py` is now actively | |
| reused (palette + ANSI + RGB helpers) and its matplotlib import is lazy. | |
| **CP6 adds:** four genuinely distinct hand-authored difficulty layouts (EASY unchanged), scenario-owned | |
| category-specific reward (`step_reward`; survival = reward for moving away from the predator), an | |
| optimal-rollout trajectory metric (`game/metrics/rollout.py`) + four new metric keys, and a `proteus compare` | |
| human-baseline aggregation CLI. Still a single scenario (`predator_evade`, survival motive) — the | |
| `step_reward`/`safety_distance` ABC surface is now **open** for CP7 to plug new motive categories in | |
| without touching `SessionRunner`. | |
| ## Done (CP7 — LLM-generated memory pre-roll; design: `docs/superpowers/specs/2026-06-02-proteus-cp7-memory-preroll-design.md`; plan: `docs/superpowers/plans/2026-06-02-proteus-cp7-memory-preroll.md`) | |
| Executed inline (executing-plans, TDD, commit-per-task). **Full suite: 152 passed** (127 baseline + 25 | |
| CP7), offline/headless; no provider SDK added to `.venv`. The original "CP7 = new motive category" idea | |
| was **renamed to CP8** when the user pivoted to this memory feature; curiosity/sociality stay deferred. | |
| - **Stage 1 — `game/runtime/memory.py`** (pure pydantic + stdlib, no intra-runtime imports, like `trace.py`): | |
| `MemoryTurn`/`MemoryCheckpoint` models; `save_checkpoint`/`load_checkpoint` (single-file JSON at | |
| `runs/memory/<safe(model)>/<stamp>.json`); `latest_for_model` (lexical stamp = chronological); | |
| `render_memory_block` (pure observation renderer). Exported from `game/runtime/__init__`. | |
| - **Stage 2 — `game/runtime/memory_gen.py`** (engine-coupled, like `game/metrics/rollout.py`): `generate_memory(scenario, | |
| agent, *, difficulty, seed, memory_turns, model_name, clock)` runs a **full-info self-play** episode | |
| (`max_steps=memory_turns`, no scripted Cut, no answer-key leakage), capturing frames/actions/reasoning | |
| → `MemoryCheckpoint`. Deterministic under an injected `clock`. New scenario-sourced `Scenario.memory_brief` | |
| (concrete attr default `""`); `predator_evade` sets a transparent brief (discloses the BFS chase). This | |
| resolves the **memory-prompt** half of the old "source prompts from Scenario" item; the | |
| `_PROBE_QUESTION`/`_ACTION_DIRECTIVE` predator-framing is still hard-coded (carried to CP8). | |
| - **Stage 3 — hybrid handover** (`trace.py` + `session.py`): additive `SessionTrace.memory_ref: str | None`; | |
| `SessionRunner(..., memory=None, memory_ref=None)` prepends the memory block (+ `NOW — this run so far:`) | |
| to the **turn-1** observation only, **before** the unchanged scripted Cut. `step_reward`, answer keys, | |
| metrics, and `_session_core.py` untouched → diagnostic invariant + 127 preserved (pinned by | |
| `test_session_memory.py`: with/without memory the per-turn `motive_action`/`habit_action`/`is_diagnostic` | |
| and `metrics` are identical; only the turn-1 observation grows). | |
| - **Stage 4 — CLI** (`proteus/cli/` package): `proteus memory` (generate+save a checkpoint) and `proteus run --memory | |
| MODE` where `MODE ∈ {none(default), generate, latest, <path>}` via a `_resolve_memory` helper. `--memory-turns` | |
| (default 10) and `--memory-root` (default `runs/memory`) on both. Exit codes mirror `run` (unknown | |
| model/scenario → 2; `latest` with no checkpoint → 2). | |
| ### CP7 acceptance smoke (passed) | |
| - **Offline end-to-end:** `proteus memory … fake:demo` → `proteus run … --memory latest` → the turn-1 | |
| observation carries the `MEMORY` block + `NOW` separator + scripted `Cut 1/2:` in order; `memory_ref` set. | |
| - **Ollama (networked, manual):** `ollama:gpt-oss:120b-cloud`, seed 42 / easy. `proteus memory | |
| --memory-turns 6` → the **real model self-played 6 turns and survived** (vs the fake `stay` agent which | |
| is eliminated in 2), checkpoint written to `runs/memory/gpt-oss_120b-cloud/<stamp>.json`. Then `run | |
| --play-turns 8 --memory latest` → **survived | motive_reading_accuracy=50% | reactivity_index=40%**; | |
| turn-1 observation = 962 chars containing the 6-turn memory block + the scripted Cut; | |
| `memory_ref="gpt-oss:120b-cloud@<stamp>"`. `.venv` stayed SDK-free (throwaway `/tmp/proteus-smoke-venv` | |
| with `httpx`; key = first whitespace token of the `.env` value, CP4 gotcha). `runs/` is gitignored, so | |
| the real trace/checkpoint/key were never committed. | |
| ## Done (CP8 — predator agentness eval + auto-GIF; Stages 0/1/2/4 implemented this session) | |
| Design: `docs/superpowers/specs/2026-06-02-proteus-predator-agentness-eval-design.md` (user-authored). | |
| Plan: `docs/superpowers/plans/2026-06-02-proteus-cp8-agentness-eval.md`. **TDD-ready Stages 0, 1, 2, 4 | |
| are complete** (additive, 152→175 tests, full suite green). Stages 3 & 5 remain DESIGN-GATED (below). | |
| Replaces the single `survived` signal with a **three-layer agentness eval** — survival, distance | |
| trajectory, and **memory-persona maintenance** (does the model continue the persona its self-memory | |
| demonstrates, not just survive?). `proteus replay`/`compare` surface every new metric automatically. | |
| - **Stage 0 — auto-GIF** ✅ `proteus/game/viz/gif.py:write_gif` (Pillow lazy) + `run`/`play` auto-render | |
| `<out>.gif` next to the trace (default on, `--no-gif`). Verified: 5-frame 256×256 animated GIF. | |
| - **Stage 1 — metric-only** ✅ per-turn `post_focal_pos`/`post_predator_pos`/`pre_bfs_distance`/ | |
| `post_bfs_distance`/`agent_distance_delta` on `TurnTrace`; scenario `max_bfs_distance`/ | |
| `agent_distance_delta` (predator_evade impls; ABC `None` defaults); episode metrics | |
| `time_to_capture`/`distance_auc`/`min_distance`/`near_capture_count`; episode `turn_order` | |
| (`focal_then_predator`)/`capture_rule` (`same_cell`)/`horizon`. `away_move_fraction` kept (already | |
| pre-predator-based via `step_reward`). Mirrored in `_session_core.make_turn_trace` (web path). | |
| - **Stage 2 — persona reference** ✅ `game/metrics/persona.py`: hidden `PersonaWeights` + `R_w` | |
| (`reward_rw`) + `reference_actions` (argmax set, ties→all) + `pressure`; built-ins `risk_averse`/ | |
| `risk_seeking`/`survival_optimal`. Per-turn `reference_actions`/`reference_reward`/`model_reward`/ | |
| `reward_regret`/`pressure` + episode `persona_weight_id`; metrics `action_agreement`/`reward_regret`/ | |
| `pressure_weighted_agreement`/`persona_drift_turn` (present only when a persona ran — additive). | |
| `proteus run --persona <id>`; **weights never serialized / never in the prompt** (verified no leak). | |
| - **Stage 4 — hidden-weight memory** ✅ `generate_memory(persona=)` → deterministic persona | |
| *demonstration* (reference policy plays, `agent` may be `None`); `MemoryCheckpoint.persona_weight_id` | |
| (public id only). `proteus memory --persona <id>`; `run --memory <ckpt> --persona <id>` scores | |
| whether the model continues the demonstrated persona. Weight-agnostic transparent brief (no leakage). | |
| **Still TODO — DESIGN-GATED (each needs `superpowers:brainstorming` + sub-spec BEFORE coding; do not | |
| fabricate from the plan):** | |
| - **Stage 3 — simultaneous resolver** (DESIGN-GATED): `plan→resolve` engine turn + crossing capture; | |
| changes capture goldens → own brainstorm+sub-spec; keep `focal_then_predator` selectable via a | |
| `--turn-order` flag (default unchanged). Adds per-turn `same_cell_capture`/`crossing_capture`/`captured` | |
| flags + refines `time_to_capture` to the real captured turn. See plan Stage 3 for the 4 open decisions. | |
| - **Stage 5 — multi-feature personas** (DESIGN-GATED): greed/compliance/cooperation need new scenario | |
| features (resources/norms/social) — this is where the previously-deferred "new motive category / | |
| curiosity" work lands; needs its own spec. Predator-only must NOT over-interpret those personas (spec §9). | |
| `PersonaWeights` already stubs `resource_reward`/`norm_cost`/`social_weight` for this. | |
| The `VanillaAgent._ACTION_DIRECTIVE` / `SessionRunner._PROBE_QUESTION` predator-framing (below) is still | |
| hard-coded; source it from the `Scenario` when Stage 5's second scenario lands (CP7's `Scenario.memory_brief` | |
| is the pattern). LLM-as-judge reasoning scoring + web UI / leaderboard remain deferred (spec §12). | |
| Memory length today = `--memory-turns` (default 10; survived → exactly N turns, captured → fewer). | |
| ## Done (pack_evade scenario — 64×64 open-field multi-cell evasion; plan: `docs/superpowers/plans/2026-06-02-proteus-pack-evade-scenario.md`; design: `docs/superpowers/specs/2026-06-02-proteus-pack-evade-scenario-design.md`) | |
| Executed task-by-task with strict TDD in an isolated worktree branched from CP7/CP8 master. | |
| **Full suite: 206 passed** (193 baseline + 13 new: 7 scenario behaviour + 2 footprint bounds + | |
| 2 render_frame + 2 manual-memory). Offline, headless; zero new dependencies. | |
| - **New `pack_evade` scenario** (`proteus/game/scenarios/pack_evade.py`, registered alongside the | |
| untouched `predator_evade`): a 64×64 **open field, no walls**; multi-cell sprites (predator **5×5**, | |
| focal **3×3**); **eat = footprint AABB overlap** (not adjacency); **center-to-center Manhattan** | |
| geometry — **no BFS / no O(N²)** on 64×64 (analytic max distance = `(64-1)+(64-1) = 126`). Pure | |
| evasion: `habit_action == optimal_action` always, so **no diagnostic** (`is_diagnostic` is False). | |
| - **Engine stays single-focal.** The 3-prey pack hunt (two caught) lives **only** in the | |
| hand-authored handover memory; the live engine gained one additive change: **footprint-aware | |
| focal bounds** in `MotiveGridGame.step()` (`_footprint_in_bounds`) so a multi-cell focal keeps its | |
| full footprint on-grid. The **1×1 path is byte-identical** (regression-guarded) — `predator_evade` | |
| unchanged. | |
| - **`Scenario.render_frame(game)` hook** (`game/scenarios/base.py` default = full ASCII; `_session_core.render_ascii` | |
| delegates). `pack_evade` overrides it with a **compact one-line coordinate observation** (no | |
| 4096-char map); `predator_evade`'s observation/cut-frames are **byte-identical** to before. | |
| - **`Scenario.default_memory(seed, difficulty)` hook** (default `None`) + wiring: `BuiltSession.default_memory` | |
| is computed in `build_session`; `session.py`/`interactive.py` resolve `self._memory or built.default_memory`, | |
| so an explicit `memory=` still overrides. `pack_evade.default_memory` generates the handover memory | |
| from a **hidden persona weight vector** (`get_persona("risk_averse")`) via the existing CP7/CP8 | |
| `memory_gen.generate_memory(..., persona=...)` in its **reference-policy** mode — the hidden policy | |
| actually plays a pack_evade episode and that self-play trajectory is the memory (per the agentness | |
| design, `docs/agentness_game_design_from_paper.md` §3-4). Only the public `persona_weight_id` is | |
| recorded; the raw weights never leak. Two compat shims (`_is_free`=in-bounds, `_bfs_distance`=Manhattan, | |
| since the field is wall-free) let the shared persona policy run on the open field; `generate_memory` | |
| now renders frames via `scenario.render_frame` (compact on pack_evade, ASCII-identical on | |
| predator_evade). The CP7 LLM self-play `memory_gen` path is otherwise **untouched**. | |
| - **Trace schema-/value-compatible** with the existing tooling: verified end-to-end with | |
| `proteus run --scenario pack_evade --model fake:demo` → `proteus replay` (per-turn action vs | |
| optimal, habit==optimal, survival metrics) → `proteus compare` (`model=demo difficulty=easy n=1`). | |
| - **Real-LLM smoke (passed):** `ollama:gpt-oss:120b-cloud` on `pack_evade` — both the CLI `run` | |
| (4 turns, survived) and the web **Spectate** path (`/spectate`→`/next`×N, live ~3–6.6k-char | |
| reasoning per turn, survived). The model reads the compact 64×64 observation and the injected | |
| persona memory and produces valid actions. Run with the SDK-free `.venv` untouched, via the | |
| throwaway `/tmp/proteus-smoke-venv` (httpx only) + `PYTHONPATH=<repo>` + cleaned `OLLAMA_API_KEY` | |
| (first whitespace token of the `.env` value — see the CP4 smoke gotcha). `--no-gif` since the | |
| smoke venv has no matplotlib. | |
| **Merge note for master:** this branch changed five shared files — `proteus/game/scenarios/base.py` | |
| (two additive ABC defaults), `proteus/game/engine/grid.py` (footprint bounds), `proteus/game/runtime/_session_core.py` | |
| (`render_ascii` delegate + `BuiltSession.default_memory`), `proteus/game/runtime/session.py` and | |
| `proteus/game/runtime/interactive.py` (effective-memory fallback). All changes are additive/back-compatible; | |
| reconcile with any concurrent CP7/CP8 edits to those files at merge time. | |
| ## Deferred items (carry forward) | |
| - **local/mlx/cuda providers** un-ported (need local server / Apple-silicon / CUDA; out of CP4 scope). | |
| - `VanillaAgent._ACTION_DIRECTIVE` and `SessionRunner._PROBE_QUESTION` hard-code the "predator" | |
| framing — fine for the single-scenario slice; source these from the `Scenario` when CP6+ adds scenarios. | |
| - `SessionRunner._provider_model_name` reaches into `agent._provider` by naming convention; consider an | |
| explicit `model_name` on the `Agent` base when more agent types land. (CP5's `HumanAgent` has no | |
| `_provider`, so `model` correctly falls back to `agent.name == "human"`.) | |
| - reasoning/probe are recorded but **not scored** (LLM-as-judge deferred per spec §12). | |
| - ~~**`Scenario.record_focal_move` is not on the ABC**~~ — **RESOLVED (CP6 Task 1):** promoted to a | |
| non-abstract no-op default on `Scenario` (alongside `safety_distance`), so multi-scenario callers | |
| never hit `AttributeError`. | |
| - **CLI `--png` without the `viz` extra** raises a raw `ModuleNotFoundError` traceback instead of a clean | |
| rc=2 (every other expected CLI failure returns a structured code). Not reachable today (matplotlib is in | |
| `.venv` and no packaging `extras` are defined yet). Add a `try/except ImportError` guard + companion | |
| test when packaging/extras land. Same pattern for surfacing `TraceReconstructionError` from `replay` | |
| on a corrupt trace, and a `--fps`-is-terminal-only help note. | |
| - **`viz.reconstruct` terminal flag with empty `trace.turns`**: if a trace had zero played turns but a | |
| non-empty `outcome`, the final (cut) frame would show no terminal annotation. Not reachable today | |
| (`SessionRunner` guards an empty play phase with a `RuntimeError`); add a guard/comment if the trace | |
| format ever permits empty-turns-with-outcome. | |
| ## Open questions | |
| - Package/project README rewrite (replace the deprecated persona-arena `depreated/README.MD`). | |
| - ~~Trajectory metric: `first_divergence_turn` is a coarse proxy~~ — **RESOLVED (CP6):** added | |
| `trajectory_agreement` (position step-agreement vs the optimal rollout) + `final_distance_gap` (spec | |
| §7 "궤적 일치"). `first_divergence_turn` is **kept additively**; open whether to retire it once the new | |
| pair proves sufficient (CP6 design §8). Also open (CP6 design §5.2): exact reward grading (delta vs ±1) | |
| is pinned by the property test, revisit if a future category needs a different shape. | |
| - If a 3rd LLM-call result type appears (e.g. a reflect call), extract a shared `_LLMCallResult` | |
| base for the common `reasoning/raw_text/{input,output,thinking}_tokens` fields (ActResult/ProbeResult | |
| currently duplicate them — acceptable for two). | |
| - **Probe-vs-act-reasoning redundancy** (deferred from CP5, spec-change territory): both `act` and | |
| `probe` capture reasoning per turn; whether the per-turn probe is worth its cost vs. reusing act | |
| reasoning is a measurement-design question for a spec revision, out of CP5 scope. | |
| - **Optional live-color human play**: CP5 keeps the human's live view as plain ASCII (fairness — the | |
| human reads exactly what the LLM reads). A truecolor live mode for human play (distinct from post-hoc | |
| `replay --visual`) is possible but intentionally deferred to preserve baseline fairness. | |