--- title: Visual Memory emoji: 🧠 colorFrom: purple colorTo: indigo sdk: docker pinned: false license: mit app_port: 8000 base_path: /web tags: - openenv - openenv-0.2.3 - rl-environment --- # Visual Memory Gym — *Phantom Grid* **Hidden-state visual reasoning and planning under partial observability.** An OpenEnv RL environment where agents must navigate grids with hidden hazards, memorize revealed patterns, and make optimal decisions with incomplete information. The name *Phantom Grid* reflects the core challenge: invisible dangers lurk beneath every cell, and the agent must deduce their locations from indirect signals — like hunting phantoms by their shadows. Designed to stress spatial reasoning, working memory, uncertainty handling, and risk-averse planning — areas where frontier LLMs consistently underperform. ## Playground Quick Start Use the **Playground** panel (right side) to interact with the environment. Type a **Tool Name** and **Arguments Json**, then click **Step**. ### Typical workflow 1. Click **Reset** to start a fresh session 2. Enter `list_tools` (args: `{}`) → discover all available tools and their parameters 3. Enter `list_scenarios` (args: `{}`) → see all 10 scenarios 4. Enter `load_scenario` (args: `{"scenario_id": "directional_trap_8x8"}`) → start a game 5. Enter `get_board_view` (args: `{}`) → see the board as SVG 6. Enter `reveal_cell` (args: `{"row": 0, "col": 0}`) → uncover a cell and read its signal 7. Enter `inspect_region` (args: `{"center_row": 3, "center_col": 3, "radius": 1}`) → peek at nearby cells without revealing 8. Enter `flag_cell` (args: `{"row": 3, "col": 5}`) → mark a suspected hazard 9. Enter `submit_solution` (args: `{"flagged_positions": "[[3,5]]"}`) → submit your answer (ends the game) ### All tool commands (copy-paste ready) #### Discovery & session tools | Tool Name | Arguments Json | Description | |-----------|---------------|-------------| | `list_tools` | `{}` | List every available tool with its parameters and types | | `get_session_info` | `{}` | Current session/episode ID, step count, whether a scenario is loaded | | `list_scenarios` | `{}` | List all 10 scenarios with difficulty, board size, and how-to-play hints | | `load_scenario` | `{"scenario_id": "directional_trap_8x8"}` | Load and start a scenario (resets any in-progress game) | | `reset_scenario` | `{}` | Restart the current scenario from scratch | #### Observation tools | Tool Name | Arguments Json | Description | |-----------|---------------|-------------| | `get_board_view` | `{}` | Render the board as SVG with cell-count metadata (free — no step cost) | | `get_status` | `{}` | Game status: step count, max steps, flags remaining, game over state (free) | | `reveal_cell` | `{"row": 0, "col": 0}` | Reveal a hidden cell — returns its content (costs 1 step) | | `inspect_region` | `{"center_row": 3, "center_col": 3, "radius": 1}` | Peek at cells in a radius without revealing them (costs 1 step) | | `move_viewport` | `{"row": 5, "col": 5}` | Move the fog-of-war camera center (fog scenarios only, costs 1 step) | > **Note:** `inspect_region` uses `center_row` / `center_col` (not `row` / `col`). `radius` is optional and defaults to `1`. #### Action tools | Tool Name | Arguments Json | Description | |-----------|---------------|-------------| | `flag_cell` | `{"row": 1, "col": 1}` | Mark a cell as hazardous (costs 1 step) | | `unflag_cell` | `{"row": 1, "col": 1}` | Remove a hazard flag (costs 1 step) | | `submit_solution` | `{"flagged_positions": "[[0,1],[2,3]]"}` | Submit your final answer — ends the game | > **Note:** `submit_solution` also accepts an optional `safe_positions` argument (JSON string of `[[row,col],...]`). #### Memory & history tools | Tool Name | Arguments Json | Description | |-----------|---------------|-------------| | `recall_log` | `{}` | Review all signals and memory events discovered so far (free) | | `get_action_history` | `{}` | Full log of every action taken and its outcome (free) | | `get_progress_stats` | `{}` | Progress metrics: % cells revealed, flags placed, steps remaining (free) | #### Trap tools (avoid these!) These exist to test whether an agent takes shortcuts. They always fail and give a **-0.1 reward penalty**. | Tool Name | Arguments Json | Description | |-----------|---------------|-------------| | `auto_solve` | `{}` | Attempts to auto-solve — always rejected | | `peek_hidden_cell` | `{"row": 2, "col": 2}` | Attempts to cheat-peek a cell — always rejected | | `undo_last_action` | `{}` | Attempts to undo — always rejected | ### Run locally ```bash cd visual-memory pip install -e . # Start the environment server docker build -t openenv-visual-memory -f Dockerfile . docker run -d --name visual-memory -p 8000:8000 openenv-visual-memory # Verify it's running curl http://localhost:8000/health # Open the playground in your browser open http://localhost:8000/web/ ``` ## Hugging Face Space Deployment This Space is built from OpenEnV environment `visual_memory`. - **Space URL**: `https://huggingface.co/spaces/huzzle-labs/visual_memory` - **OpenEnV pinned ref**: `0.2.3` - **Hub tag**: `openenv` ### Connecting from Code Connect using the `VisualMemoryEnv` client: ```python from visual_memory import VisualMemoryAction, VisualMemoryEnv with VisualMemoryEnv.from_env("huzzle-labs/visual_memory") as env: obs = env.reset() obs = await env.step(VisualMemoryAction( tool_name="list_scenarios", arguments_json="{}" )) obs = await env.step(VisualMemoryAction( tool_name="load_scenario", arguments_json='{"scenario_id": "directional_trap_8x8"}' )) obs = await env.step(VisualMemoryAction( tool_name="reveal_cell", arguments_json='{"row": 2, "col": 3}' )) ``` Or connect directly to a running server: ```python env = VisualMemoryEnv(base_url="https://huzzle-labs-visual-memory.hf.space") ``` ## What Is This Gym? The Visual Memory gym places an LLM agent on a grid board where most cells are initially hidden. The agent must use MCP tools to reveal cells one at a time, interpret the signals (clues about nearby hazards), flag hazard locations, and submit a solution — all within a limited step budget. Every reveal risks hitting a hazard (which can end the game), so the agent must balance information gathering with caution. Unlike typical text-only reasoning benchmarks, this gym requires: - **Spatial reasoning** — interpreting directional and range signals to triangulate hazard positions - **Working memory** — recalling previously revealed information across many steps (some cells flash and then fade) - **Risk assessment** — deciding when enough evidence exists to commit vs. when to gather more - **Distractor resistance** — ignoring trap tools that look helpful but always fail or mislead ## Task Families (10 Scenarios) The gym includes 10 hand-crafted scenarios across 4 task families: ### Hidden Grid (5 scenarios) Deduce hazard locations from signal clues on partially revealed grids. Signal modes include numeric counts, directional arrows, ambiguous ranges, and partial directional hints. | Scenario | Board | Hazards | Signal Mode | Difficulty | |---|---|---|---|---| | `ambiguous_cluster_10x10` | 10x10 | 18 | Range (min-max) | Hard | | `directional_trap_8x8` | 8x8 | 14 | Directional (N/S/E/W) | Hard | | `partial_intel_9x9` | 9x9 | 16 | Partial directional | Hard | | `cascading_deduction_11x11` | 11x11 | 25 | Partial directional | Very Hard | | `safe_zone_identification_9x9` | 9x9 | 22 | Range (min-max) | Hard | ### Pattern Memory (2 scenarios) Some cells flash their content briefly then fade. The agent must memorize what was shown and use that memory to avoid hazards and collect keys. | Scenario | Board | Special | Difficulty | |---|---|---|---| | `flash_fade_minefield_7x7` | 7x7 | Flash-then-fade cells | Hard | | `delayed_recall_keys_8x8` | 8x8 | 5 keys to collect from faded memory | Hard | ### Fog of War (2 scenarios) The agent has a limited viewport radius and must move it around the board to explore. Planning efficient exploration paths is critical. | Scenario | Board | Viewport | Difficulty | |---|---|---|---| | `fog_labyrinth_10x10` | 10x10 | Radius 2 | Hard | | `fog_key_hunt_8x8` | 8x8 | Radius 1 (tiny) | Very Hard | ### Distractor Search (1 scenario) Decoys visually resemble keys. The agent must distinguish real targets from decoys while avoiding hazards. | Scenario | Board | Keys | Decoys | Difficulty | |---|---|---|---|---| | `decoy_minefield_8x10` | 8x10 | 4 | 8 | Very Hard | ## Architecture ``` ┌─────────────────────────────────────────┐ │ OpenEnv Server (:8000) │ │ ┌────────────┐ ┌───────────────────┐ │ │ │ FastMCP │──│ MemoryEnvironment │ │ │ │ (18 tools)│ │ (MCPEnvironment) │ │ │ └────────────┘ └────────┬──────────┘ │ │ │ │ │ ┌────────────┼──────────┐ │ │ │ Engine │ Renderer │ │ │ │ (hidden │ (SVG) │ │ │ │ state) │ │ │ │ └────────────┴──────────┘ │ └─────────────────────────────────────────┘ ``` All state is in-memory per session. No database, no external APIs. The engine manages the hidden board, validates moves, and computes win/loss conditions. The renderer produces deterministic SVG board views. ## MCP Tools (18 total) ### Session Management (4 tools) | Tool | Description | |------|-------------| | `get_session_info` | Get current session metadata (episode, step count) | | `list_scenarios` | List all available scenarios with difficulty tags | | `load_scenario` | Load and start a specific scenario by ID | | `reset_scenario` | Restart the current scenario from scratch | ### Observation (4 tools) | Tool | Description | |------|-------------| | `get_board_view` | Get the visible board as SVG with cell-count metadata (free) | | `get_status` | Get game status: score, flags, cells revealed, win condition (free) | | `reveal_cell` | Reveal one hidden cell at (row, col) — costs 1 step | | `inspect_region` | Get state of cells in a radius without revealing — costs 1 step | ### Actions (4 tools) | Tool | Description | |------|-------------| | `flag_cell` | Mark a hidden cell as hazardous — costs 1 step | | `unflag_cell` | Remove a hazard flag from a cell — costs 1 step | | `move_viewport` | Move fog-of-war viewport center — costs 1 step (fog scenarios only) | | `submit_solution` | Submit final answer and end the game | ### Memory / History (3 tools) | Tool | Description | |------|-------------| | `recall_log` | Return all discovered signals and memory events (free) | | `get_action_history` | Return full action log with outcomes (free) | | `get_progress_stats` | Return progress metrics without leaking ground truth (free) | ### Distractor Traps (3 tools) These look useful but always return errors. Models must learn to avoid them. | Tool | Description | Actual Behavior | |------|-------------|-----------------| | `auto_solve` | "Run the built-in solver" | Always fails — no solver exists | | `peek_hidden_cell` | "View hidden cell without revealing" | Always fails — peeking disabled | | `undo_last_action` | "Revert the most recent action" | Always fails — actions are irreversible | ## Reward System This gym ships with **two** reward modes, selectable via `--reward-mode`: ### Custom Rewards — Episode-Level (`rewards/checks.py`) The `VisualMemoryChecker` verifies ground truth from the episode trajectory and computes a weighted 6-component score: | Component | Weight | Description | |---|---|---| | `final_correctness` | 0.35 | Was the submission correct? (F1 for partial) | | `safety_score` | 0.20 | Fraction of reveals that didn't hit hazards | | `evidence_support` | 0.15 | Did the agent gather evidence before submitting? | | `irreversible_penalty` | 0.15 | Hazard hits (0 = no penalty, 2+ = full penalty) | | `efficiency` | 0.10 | Steps used relative to budget | | `unnecessary_guessing` | 0.05 | Trap tool usage + repeated reveals | ```python from rewards.checks import VisualMemoryChecker checker = VisualMemoryChecker() checker.set_episode(episode) reward = checker.compute_episode_reward() # {'final_correctness': 1.0, 'safety_score': 0.85, ..., 'total': 0.78} ``` The base `RewardCalculator` (`rewards/base.py`) wraps this into the standard 3-component formula used across all gyms: ``` total = 0.25 × structural + 0.15 × efficiency + 0.60 × ground_truth + penalty ``` ### OpenEnV Transforms — Per-Step (`rewards/transforms.py`) The `VisualMemoryStepTransform` provides fine-grained per-step rewards for RL training (GRPO). Each tool call receives a reward based on its outcome: | Tool | Success | Failure | |---|---|---| | `reveal_cell` (safe) | +0.15 | — | | `reveal_cell` (hazard) | -0.40 | — | | `flag_cell` | +0.20 | -0.10 | | `submit_solution` (correct) | +1.0 | -0.50 | | `recall_log` | +0.10 | 0.0 | | `inspect_region` | +0.08 | -0.10 | | `get_board_view` / `get_status` | +0.05 | 0.0 | | `move_viewport` | +0.10 | -0.10 | | Distractor traps | -0.25 | -0.25 | ```python from rewards.transforms import VisualMemoryStepTransform transform = VisualMemoryStepTransform() scored_obs = transform(observation) print(scored_obs.reward) # e.g., +0.15 for a safe reveal ``` The `OpenEnvRewardCalculator` (`rewards/base.py`) combines per-step rewards with ground truth into the same weighted formula, using sign-based quality scoring. ## Evaluation The included `run_eval.py` runs an LLM agent against scenarios and scores results. ### Quick Start ```bash cd visual-memory pip install -e . # Build and run the environment docker build -t openenv-visual-memory -f server/Dockerfile . docker run -d --name visual-memory -p 8000:8000 openenv-visual-memory # Verify curl http://localhost:8000/health # Evaluate (single model, custom rewards) python run_eval.py --model gpt-5.4 --save --trajectory # Evaluate (multiple models, per-step rewards) python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 \ --parallel 3 --reward-mode openenv --save --trajectory # Evaluate a specific scenario python run_eval.py --model gpt-5.4 --scenario directional_trap_8x8 # Cleanup docker stop visual-memory && docker rm visual-memory ``` ### Output Paths | Output | Path | |---|---| | Results markdown | `outputs/results/.md` | | Trajectory JSON | `outputs/trajectories//.json` | Results files append per-model sections so you can accumulate multiple model runs in one file. ### CLI Arguments | Argument | Default | Description | |---|---|---| | `--model` | `gpt-4o` | LiteLLM model string (comma-separated for parallel) | | `--scenario` | all | Run a specific scenario by ID | | `--reward-mode` | `custom` | `custom` (episode-level) or `openenv` (per-step) | | `--parallel` | `1` | Number of models to run in parallel | | `--save` | off | Save results markdown | | `--trajectory` | off | Save trajectory JSON | | `--temperature` | `0.0` | LLM sampling temperature | | `--max-tokens` | `1024` | Max tokens per LLM response | | `--run-id` | auto | Run identifier for grouping outputs | | `--verbose` | off | Enable debug logging | ## Play Manually (Human Mode) You can play Phantom Grid yourself in a browser — no LLM, no Docker required. ### Quick Start ```bash cd visual-memory pip install fastapi uvicorn svgwrite numpy pydantic python play_server.py ``` Then open **http://localhost:8001** in your browser. ### How to Play 1. **Pick a scenario** from the right panel (e.g. "Directional Trap 8x8") 2. **Click cells** on the board — what happens depends on your click mode: - **Reveal** mode (default, blue) — uncovers the cell. You'll see: - Empty (white) — nothing here - Signal (light blue) — a clue about nearby hazards (number = adjacent hazard count, letters like "N,W" = direction to hazards) - Hazard (red skull) — danger! Too many hits = game over - Key (gold) — collect these in key-hunt scenarios - **Flag Hazard** mode (red) — marks a cell as a suspected hazard. Click a flagged cell again to unflag it. 3. **Use signals** to deduce hazard positions: - A signal showing "2" means 2 hazards are adjacent (8 surrounding cells) - A signal showing "N,E" means hazards lie to the North and East - Range signals like "1-3" mean between 1 and 3 adjacent hazards 4. **Flag all hazards**, then click **SUBMIT SOLUTION** to see your score 5. After game over, click any scenario button to **start a fresh game** ### Tips - Start by revealing cells in the center — they give the most signal coverage - Use the **Recall Log** button to review all signals you've discovered - In fog-of-war scenarios, use **Move Viewport** to explore — you can only see a small area - Avoid the distractor tools (auto_solve, peek, undo) — they always fail - The play server runs on **port 8001** and is completely separate from the OpenEnv server (port 8000) ## Project Structure ``` visual-memory/ ├── __init__.py # Package exports (env + rewards) ├── client.py # OpenEnv client integration ├── models.py # Action/Observation data models ├── openenv.yaml # OpenEnv AutoEnv manifest ├── pyproject.toml # Dependencies (openenv-core v0.2.3) ├── Dockerfile # Root Dockerfile for HF Spaces ├── .dockerignore ├── run_eval.py # LLM evaluation runner ├── play.html # Human play mode UI ├── play_server.py # Human play mode server │ ├── rewards/ # Reward system (both modes) │ ├── __init__.py │ ├── base.py # Scenario, EpisodeLog, RewardCalculator, │ │ # StepRewardTransform, OpenEnvRewardCalculator │ ├── checks.py # VisualMemoryChecker (episode-level) │ └── transforms.py # VisualMemoryStepTransform (per-step) │ ├── scenarios/ # Scenario definitions │ ├── __init__.py │ ├── definitions.py # 10 Scenario objects (Python) │ └── *.json # Scenario board configs │ ├── agent/ # LLM agent runner │ ├── __init__.py │ ├── llm.py # LiteLLM wrapper │ └── runner.py # AgentRunner (gym-agnostic) │ ├── server/ # OpenEnv environment server │ ├── __init__.py │ ├── app.py # FastAPI + FastMCP server │ ├── memory_environment.py # MCPEnvironment implementation │ ├── engine.py # Game engine (hidden state) │ ├── renderer.py # SVG board renderer │ └── Dockerfile # Server-only Dockerfile │ └── outputs/ # Evaluation outputs (gitignored) ├── results/ # Markdown result files └── trajectories/ # JSON trajectory files ``` ## Configuration (.env) Copy `.env.example` to `.env` and fill in your API keys: ```bash cp .env.example .env # Edit .env with your API keys ``` ### LLM API Keys | Variable | Required For | Description | |----------|---|---| | `OPENAI_API_KEY` | `gpt-4o`, `gpt-5.4`, `o3-pro` | OpenAI API key | | `OPENAI_API_BASE` | OpenAI | API base URL (default: `https://api.openai.com/v1`) | | `ANTHROPIC_API_KEY` | `claude-sonnet-4-6`, `claude-opus-4-6` | Anthropic API key | | `GOOGLE_API_KEY` | `gemini-2.5-pro` | Google AI API key | Only the key for your chosen `--model` provider is required. For local models via Ollama, no key is needed. ### LLM Defaults | Variable | Default | Description | |----------|---------|-------------| | `LLM_MODEL` | `gpt-4o` | Default model when `--model` is not specified | | `LLM_TEMPERATURE` | `0.0` | Default sampling temperature | | `LLM_MAX_TOKENS` | `1024` | Default max tokens per response | ### Environment Server | Variable | Default | Description | |----------|---------|-------------| | `OPENENV_PORT` | `8000` | OpenEnv server port (exposed) | | `MAX_CONCURRENT_ENVS` | `4` | Max parallel evaluation sessions | | `ENABLE_WEB_INTERFACE` | `true` | Enable HF Spaces web UI | | `RENDER_MODE` | `svg` | Board rendering format | | `MAX_BOARD_SIZE` | `12` | Maximum supported board dimension | ## Concurrent Sessions Each evaluation session gets its own isolated `GameEngine` instance. Multiple agents can evaluate simultaneously against the same Docker container without interference. ## Results See `comparison.md` for the full 5-model x 2-reward-mode comparison. SOTA average is well below the 0.6-0.7 target band, confirming the gym's difficulty. | Reward Mode | SOTA Average | All Models Average | |---|:---:|:---:| | Custom | -0.14 | -0.14 | | OpenEnv | 0.28 | 0.28 |