Spaces:
Sleeping
Sleeping
| title: Visual Memory | |
| emoji: π§ | |
| colorFrom: purple | |
| colorTo: indigo | |
| sdk: docker | |
| pinned: false | |
| license: mit | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| - openenv-0.2.3 | |
| - rl-environment | |
| # Visual Memory Gym β *Phantom Grid* | |
| **Hidden-state visual reasoning and planning under partial observability.** | |
| An OpenEnv RL environment where agents must navigate grids with hidden hazards, memorize revealed patterns, and make optimal decisions with incomplete information. The name *Phantom Grid* reflects the core challenge: invisible dangers lurk beneath every cell, and the agent must deduce their locations from indirect signals β like hunting phantoms by their shadows. Designed to stress spatial reasoning, working memory, uncertainty handling, and risk-averse planning β areas where frontier LLMs consistently underperform. | |
| ## Playground Quick Start | |
| Use the **Playground** panel (right side) to interact with the environment. Type a **Tool Name** and **Arguments Json**, then click **Step**. | |
| ### Typical workflow | |
| 1. Click **Reset** to start a fresh session | |
| 2. Enter `list_tools` (args: `{}`) β discover all available tools and their parameters | |
| 3. Enter `list_scenarios` (args: `{}`) β see all 10 scenarios | |
| 4. Enter `load_scenario` (args: `{"scenario_id": "directional_trap_8x8"}`) β start a game | |
| 5. Enter `get_board_view` (args: `{}`) β see the board as SVG | |
| 6. Enter `reveal_cell` (args: `{"row": 0, "col": 0}`) β uncover a cell and read its signal | |
| 7. Enter `inspect_region` (args: `{"center_row": 3, "center_col": 3, "radius": 1}`) β peek at nearby cells without revealing | |
| 8. Enter `flag_cell` (args: `{"row": 3, "col": 5}`) β mark a suspected hazard | |
| 9. Enter `submit_solution` (args: `{"flagged_positions": "[[3,5]]"}`) β submit your answer (ends the game) | |
| ### All tool commands (copy-paste ready) | |
| #### Discovery & session tools | |
| | Tool Name | Arguments Json | Description | | |
| |-----------|---------------|-------------| | |
| | `list_tools` | `{}` | List every available tool with its parameters and types | | |
| | `get_session_info` | `{}` | Current session/episode ID, step count, whether a scenario is loaded | | |
| | `list_scenarios` | `{}` | List all 10 scenarios with difficulty, board size, and how-to-play hints | | |
| | `load_scenario` | `{"scenario_id": "directional_trap_8x8"}` | Load and start a scenario (resets any in-progress game) | | |
| | `reset_scenario` | `{}` | Restart the current scenario from scratch | | |
| #### Observation tools | |
| | Tool Name | Arguments Json | Description | | |
| |-----------|---------------|-------------| | |
| | `get_board_view` | `{}` | Render the board as SVG with cell-count metadata (free β no step cost) | | |
| | `get_status` | `{}` | Game status: step count, max steps, flags remaining, game over state (free) | | |
| | `reveal_cell` | `{"row": 0, "col": 0}` | Reveal a hidden cell β returns its content (costs 1 step) | | |
| | `inspect_region` | `{"center_row": 3, "center_col": 3, "radius": 1}` | Peek at cells in a radius without revealing them (costs 1 step) | | |
| | `move_viewport` | `{"row": 5, "col": 5}` | Move the fog-of-war camera center (fog scenarios only, costs 1 step) | | |
| > **Note:** `inspect_region` uses `center_row` / `center_col` (not `row` / `col`). `radius` is optional and defaults to `1`. | |
| #### Action tools | |
| | Tool Name | Arguments Json | Description | | |
| |-----------|---------------|-------------| | |
| | `flag_cell` | `{"row": 1, "col": 1}` | Mark a cell as hazardous (costs 1 step) | | |
| | `unflag_cell` | `{"row": 1, "col": 1}` | Remove a hazard flag (costs 1 step) | | |
| | `submit_solution` | `{"flagged_positions": "[[0,1],[2,3]]"}` | Submit your final answer β ends the game | | |
| > **Note:** `submit_solution` also accepts an optional `safe_positions` argument (JSON string of `[[row,col],...]`). | |
| #### Memory & history tools | |
| | Tool Name | Arguments Json | Description | | |
| |-----------|---------------|-------------| | |
| | `recall_log` | `{}` | Review all signals and memory events discovered so far (free) | | |
| | `get_action_history` | `{}` | Full log of every action taken and its outcome (free) | | |
| | `get_progress_stats` | `{}` | Progress metrics: % cells revealed, flags placed, steps remaining (free) | | |
| #### Trap tools (avoid these!) | |
| These exist to test whether an agent takes shortcuts. They always fail and give a **-0.1 reward penalty**. | |
| | Tool Name | Arguments Json | Description | | |
| |-----------|---------------|-------------| | |
| | `auto_solve` | `{}` | Attempts to auto-solve β always rejected | | |
| | `peek_hidden_cell` | `{"row": 2, "col": 2}` | Attempts to cheat-peek a cell β always rejected | | |
| | `undo_last_action` | `{}` | Attempts to undo β always rejected | | |
| ### Run locally | |
| ```bash | |
| cd visual-memory | |
| pip install -e . | |
| # Start the environment server | |
| docker build -t openenv-visual-memory -f Dockerfile . | |
| docker run -d --name visual-memory -p 8000:8000 openenv-visual-memory | |
| # Verify it's running | |
| curl http://localhost:8000/health | |
| # Open the playground in your browser | |
| open http://localhost:8000/web/ | |
| ``` | |
| ## Hugging Face Space Deployment | |
| This Space is built from OpenEnV environment `visual_memory`. | |
| - **Space URL**: `https://huggingface.co/spaces/huzzle-labs/visual_memory` | |
| - **OpenEnV pinned ref**: `0.2.3` | |
| - **Hub tag**: `openenv` | |
| ### Connecting from Code | |
| Connect using the `VisualMemoryEnv` client: | |
| ```python | |
| from visual_memory import VisualMemoryAction, VisualMemoryEnv | |
| with VisualMemoryEnv.from_env("huzzle-labs/visual_memory") as env: | |
| obs = env.reset() | |
| obs = await env.step(VisualMemoryAction( | |
| tool_name="list_scenarios", | |
| arguments_json="{}" | |
| )) | |
| obs = await env.step(VisualMemoryAction( | |
| tool_name="load_scenario", | |
| arguments_json='{"scenario_id": "directional_trap_8x8"}' | |
| )) | |
| obs = await env.step(VisualMemoryAction( | |
| tool_name="reveal_cell", | |
| arguments_json='{"row": 2, "col": 3}' | |
| )) | |
| ``` | |
| Or connect directly to a running server: | |
| ```python | |
| env = VisualMemoryEnv(base_url="https://huzzle-labs-visual-memory.hf.space") | |
| ``` | |
| ## What Is This Gym? | |
| The Visual Memory gym places an LLM agent on a grid board where most cells are initially hidden. The agent must use MCP tools to reveal cells one at a time, interpret the signals (clues about nearby hazards), flag hazard locations, and submit a solution β all within a limited step budget. Every reveal risks hitting a hazard (which can end the game), so the agent must balance information gathering with caution. | |
| Unlike typical text-only reasoning benchmarks, this gym requires: | |
| - **Spatial reasoning** β interpreting directional and range signals to triangulate hazard positions | |
| - **Working memory** β recalling previously revealed information across many steps (some cells flash and then fade) | |
| - **Risk assessment** β deciding when enough evidence exists to commit vs. when to gather more | |
| - **Distractor resistance** β ignoring trap tools that look helpful but always fail or mislead | |
| ## Task Families (10 Scenarios) | |
| The gym includes 10 hand-crafted scenarios across 4 task families: | |
| ### Hidden Grid (5 scenarios) | |
| Deduce hazard locations from signal clues on partially revealed grids. Signal modes include numeric counts, directional arrows, ambiguous ranges, and partial directional hints. | |
| | Scenario | Board | Hazards | Signal Mode | Difficulty | | |
| |---|---|---|---|---| | |
| | `ambiguous_cluster_10x10` | 10x10 | 18 | Range (min-max) | Hard | | |
| | `directional_trap_8x8` | 8x8 | 14 | Directional (N/S/E/W) | Hard | | |
| | `partial_intel_9x9` | 9x9 | 16 | Partial directional | Hard | | |
| | `cascading_deduction_11x11` | 11x11 | 25 | Partial directional | Very Hard | | |
| | `safe_zone_identification_9x9` | 9x9 | 22 | Range (min-max) | Hard | | |
| ### Pattern Memory (2 scenarios) | |
| Some cells flash their content briefly then fade. The agent must memorize what was shown and use that memory to avoid hazards and collect keys. | |
| | Scenario | Board | Special | Difficulty | | |
| |---|---|---|---| | |
| | `flash_fade_minefield_7x7` | 7x7 | Flash-then-fade cells | Hard | | |
| | `delayed_recall_keys_8x8` | 8x8 | 5 keys to collect from faded memory | Hard | | |
| ### Fog of War (2 scenarios) | |
| The agent has a limited viewport radius and must move it around the board to explore. Planning efficient exploration paths is critical. | |
| | Scenario | Board | Viewport | Difficulty | | |
| |---|---|---|---| | |
| | `fog_labyrinth_10x10` | 10x10 | Radius 2 | Hard | | |
| | `fog_key_hunt_8x8` | 8x8 | Radius 1 (tiny) | Very Hard | | |
| ### Distractor Search (1 scenario) | |
| Decoys visually resemble keys. The agent must distinguish real targets from decoys while avoiding hazards. | |
| | Scenario | Board | Keys | Decoys | Difficulty | | |
| |---|---|---|---|---| | |
| | `decoy_minefield_8x10` | 8x10 | 4 | 8 | Very Hard | | |
| ## Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β OpenEnv Server (:8000) β | |
| β ββββββββββββββ βββββββββββββββββββββ β | |
| β β FastMCP ββββ MemoryEnvironment β β | |
| β β (18 tools)β β (MCPEnvironment) β β | |
| β ββββββββββββββ ββββββββββ¬βββββββββββ β | |
| β β β | |
| β ββββββββββββββΌβββββββββββ β | |
| β β Engine β Renderer β β | |
| β β (hidden β (SVG) β β | |
| β β state) β β β | |
| β ββββββββββββββ΄βββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| All state is in-memory per session. No database, no external APIs. The engine manages the hidden board, validates moves, and computes win/loss conditions. The renderer produces deterministic SVG board views. | |
| ## MCP Tools (18 total) | |
| ### Session Management (4 tools) | |
| | Tool | Description | | |
| |------|-------------| | |
| | `get_session_info` | Get current session metadata (episode, step count) | | |
| | `list_scenarios` | List all available scenarios with difficulty tags | | |
| | `load_scenario` | Load and start a specific scenario by ID | | |
| | `reset_scenario` | Restart the current scenario from scratch | | |
| ### Observation (4 tools) | |
| | Tool | Description | | |
| |------|-------------| | |
| | `get_board_view` | Get the visible board as SVG with cell-count metadata (free) | | |
| | `get_status` | Get game status: score, flags, cells revealed, win condition (free) | | |
| | `reveal_cell` | Reveal one hidden cell at (row, col) β costs 1 step | | |
| | `inspect_region` | Get state of cells in a radius without revealing β costs 1 step | | |
| ### Actions (4 tools) | |
| | Tool | Description | | |
| |------|-------------| | |
| | `flag_cell` | Mark a hidden cell as hazardous β costs 1 step | | |
| | `unflag_cell` | Remove a hazard flag from a cell β costs 1 step | | |
| | `move_viewport` | Move fog-of-war viewport center β costs 1 step (fog scenarios only) | | |
| | `submit_solution` | Submit final answer and end the game | | |
| ### Memory / History (3 tools) | |
| | Tool | Description | | |
| |------|-------------| | |
| | `recall_log` | Return all discovered signals and memory events (free) | | |
| | `get_action_history` | Return full action log with outcomes (free) | | |
| | `get_progress_stats` | Return progress metrics without leaking ground truth (free) | | |
| ### Distractor Traps (3 tools) | |
| These look useful but always return errors. Models must learn to avoid them. | |
| | Tool | Description | Actual Behavior | | |
| |------|-------------|-----------------| | |
| | `auto_solve` | "Run the built-in solver" | Always fails β no solver exists | | |
| | `peek_hidden_cell` | "View hidden cell without revealing" | Always fails β peeking disabled | | |
| | `undo_last_action` | "Revert the most recent action" | Always fails β actions are irreversible | | |
| ## Reward System | |
| This gym ships with **two** reward modes, selectable via `--reward-mode`: | |
| ### Custom Rewards β Episode-Level (`rewards/checks.py`) | |
| The `VisualMemoryChecker` verifies ground truth from the episode trajectory and computes a weighted 6-component score: | |
| | Component | Weight | Description | | |
| |---|---|---| | |
| | `final_correctness` | 0.35 | Was the submission correct? (F1 for partial) | | |
| | `safety_score` | 0.20 | Fraction of reveals that didn't hit hazards | | |
| | `evidence_support` | 0.15 | Did the agent gather evidence before submitting? | | |
| | `irreversible_penalty` | 0.15 | Hazard hits (0 = no penalty, 2+ = full penalty) | | |
| | `efficiency` | 0.10 | Steps used relative to budget | | |
| | `unnecessary_guessing` | 0.05 | Trap tool usage + repeated reveals | | |
| ```python | |
| from rewards.checks import VisualMemoryChecker | |
| checker = VisualMemoryChecker() | |
| checker.set_episode(episode) | |
| reward = checker.compute_episode_reward() | |
| # {'final_correctness': 1.0, 'safety_score': 0.85, ..., 'total': 0.78} | |
| ``` | |
| The base `RewardCalculator` (`rewards/base.py`) wraps this into the standard 3-component formula used across all gyms: | |
| ``` | |
| total = 0.25 Γ structural + 0.15 Γ efficiency + 0.60 Γ ground_truth + penalty | |
| ``` | |
| ### OpenEnV Transforms β Per-Step (`rewards/transforms.py`) | |
| The `VisualMemoryStepTransform` provides fine-grained per-step rewards for RL training (GRPO). Each tool call receives a reward based on its outcome: | |
| | Tool | Success | Failure | | |
| |---|---|---| | |
| | `reveal_cell` (safe) | +0.15 | β | | |
| | `reveal_cell` (hazard) | -0.40 | β | | |
| | `flag_cell` | +0.20 | -0.10 | | |
| | `submit_solution` (correct) | +1.0 | -0.50 | | |
| | `recall_log` | +0.10 | 0.0 | | |
| | `inspect_region` | +0.08 | -0.10 | | |
| | `get_board_view` / `get_status` | +0.05 | 0.0 | | |
| | `move_viewport` | +0.10 | -0.10 | | |
| | Distractor traps | -0.25 | -0.25 | | |
| ```python | |
| from rewards.transforms import VisualMemoryStepTransform | |
| transform = VisualMemoryStepTransform() | |
| scored_obs = transform(observation) | |
| print(scored_obs.reward) # e.g., +0.15 for a safe reveal | |
| ``` | |
| The `OpenEnvRewardCalculator` (`rewards/base.py`) combines per-step rewards with ground truth into the same weighted formula, using sign-based quality scoring. | |
| ## Evaluation | |
| The included `run_eval.py` runs an LLM agent against scenarios and scores results. | |
| ### Quick Start | |
| ```bash | |
| cd visual-memory | |
| pip install -e . | |
| # Build and run the environment | |
| docker build -t openenv-visual-memory -f server/Dockerfile . | |
| docker run -d --name visual-memory -p 8000:8000 openenv-visual-memory | |
| # Verify | |
| curl http://localhost:8000/health | |
| # Evaluate (single model, custom rewards) | |
| python run_eval.py --model gpt-5.4 --save --trajectory | |
| # Evaluate (multiple models, per-step rewards) | |
| python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 \ | |
| --parallel 3 --reward-mode openenv --save --trajectory | |
| # Evaluate a specific scenario | |
| python run_eval.py --model gpt-5.4 --scenario directional_trap_8x8 | |
| # Cleanup | |
| docker stop visual-memory && docker rm visual-memory | |
| ``` | |
| ### Output Paths | |
| | Output | Path | | |
| |---|---| | |
| | Results markdown | `outputs/results/<run_id>.md` | | |
| | Trajectory JSON | `outputs/trajectories/<run_id>/<model>.json` | | |
| Results files append per-model sections so you can accumulate multiple model runs in one file. | |
| ### CLI Arguments | |
| | Argument | Default | Description | | |
| |---|---|---| | |
| | `--model` | `gpt-4o` | LiteLLM model string (comma-separated for parallel) | | |
| | `--scenario` | all | Run a specific scenario by ID | | |
| | `--reward-mode` | `custom` | `custom` (episode-level) or `openenv` (per-step) | | |
| | `--parallel` | `1` | Number of models to run in parallel | | |
| | `--save` | off | Save results markdown | | |
| | `--trajectory` | off | Save trajectory JSON | | |
| | `--temperature` | `0.0` | LLM sampling temperature | | |
| | `--max-tokens` | `1024` | Max tokens per LLM response | | |
| | `--run-id` | auto | Run identifier for grouping outputs | | |
| | `--verbose` | off | Enable debug logging | | |
| ## Play Manually (Human Mode) | |
| You can play Phantom Grid yourself in a browser β no LLM, no Docker required. | |
| ### Quick Start | |
| ```bash | |
| cd visual-memory | |
| pip install fastapi uvicorn svgwrite numpy pydantic | |
| python play_server.py | |
| ``` | |
| Then open **http://localhost:8001** in your browser. | |
| ### How to Play | |
| 1. **Pick a scenario** from the right panel (e.g. "Directional Trap 8x8") | |
| 2. **Click cells** on the board β what happens depends on your click mode: | |
| - **Reveal** mode (default, blue) β uncovers the cell. You'll see: | |
| - Empty (white) β nothing here | |
| - Signal (light blue) β a clue about nearby hazards (number = adjacent hazard count, letters like "N,W" = direction to hazards) | |
| - Hazard (red skull) β danger! Too many hits = game over | |
| - Key (gold) β collect these in key-hunt scenarios | |
| - **Flag Hazard** mode (red) β marks a cell as a suspected hazard. Click a flagged cell again to unflag it. | |
| 3. **Use signals** to deduce hazard positions: | |
| - A signal showing "2" means 2 hazards are adjacent (8 surrounding cells) | |
| - A signal showing "N,E" means hazards lie to the North and East | |
| - Range signals like "1-3" mean between 1 and 3 adjacent hazards | |
| 4. **Flag all hazards**, then click **SUBMIT SOLUTION** to see your score | |
| 5. After game over, click any scenario button to **start a fresh game** | |
| ### Tips | |
| - Start by revealing cells in the center β they give the most signal coverage | |
| - Use the **Recall Log** button to review all signals you've discovered | |
| - In fog-of-war scenarios, use **Move Viewport** to explore β you can only see a small area | |
| - Avoid the distractor tools (auto_solve, peek, undo) β they always fail | |
| - The play server runs on **port 8001** and is completely separate from the OpenEnv server (port 8000) | |
| ## Project Structure | |
| ``` | |
| visual-memory/ | |
| βββ __init__.py # Package exports (env + rewards) | |
| βββ client.py # OpenEnv client integration | |
| βββ models.py # Action/Observation data models | |
| βββ openenv.yaml # OpenEnv AutoEnv manifest | |
| βββ pyproject.toml # Dependencies (openenv-core v0.2.3) | |
| βββ Dockerfile # Root Dockerfile for HF Spaces | |
| βββ .dockerignore | |
| βββ run_eval.py # LLM evaluation runner | |
| βββ play.html # Human play mode UI | |
| βββ play_server.py # Human play mode server | |
| β | |
| βββ rewards/ # Reward system (both modes) | |
| β βββ __init__.py | |
| β βββ base.py # Scenario, EpisodeLog, RewardCalculator, | |
| β β # StepRewardTransform, OpenEnvRewardCalculator | |
| β βββ checks.py # VisualMemoryChecker (episode-level) | |
| β βββ transforms.py # VisualMemoryStepTransform (per-step) | |
| β | |
| βββ scenarios/ # Scenario definitions | |
| β βββ __init__.py | |
| β βββ definitions.py # 10 Scenario objects (Python) | |
| β βββ *.json # Scenario board configs | |
| β | |
| βββ agent/ # LLM agent runner | |
| β βββ __init__.py | |
| β βββ llm.py # LiteLLM wrapper | |
| β βββ runner.py # AgentRunner (gym-agnostic) | |
| β | |
| βββ server/ # OpenEnv environment server | |
| β βββ __init__.py | |
| β βββ app.py # FastAPI + FastMCP server | |
| β βββ memory_environment.py # MCPEnvironment implementation | |
| β βββ engine.py # Game engine (hidden state) | |
| β βββ renderer.py # SVG board renderer | |
| β βββ Dockerfile # Server-only Dockerfile | |
| β | |
| βββ outputs/ # Evaluation outputs (gitignored) | |
| βββ results/ # Markdown result files | |
| βββ trajectories/ # JSON trajectory files | |
| ``` | |
| ## Configuration (.env) | |
| Copy `.env.example` to `.env` and fill in your API keys: | |
| ```bash | |
| cp .env.example .env | |
| # Edit .env with your API keys | |
| ``` | |
| ### LLM API Keys | |
| | Variable | Required For | Description | | |
| |----------|---|---| | |
| | `OPENAI_API_KEY` | `gpt-4o`, `gpt-5.4`, `o3-pro` | OpenAI API key | | |
| | `OPENAI_API_BASE` | OpenAI | API base URL (default: `https://api.openai.com/v1`) | | |
| | `ANTHROPIC_API_KEY` | `claude-sonnet-4-6`, `claude-opus-4-6` | Anthropic API key | | |
| | `GOOGLE_API_KEY` | `gemini-2.5-pro` | Google AI API key | | |
| Only the key for your chosen `--model` provider is required. For local models via Ollama, no key is needed. | |
| ### LLM Defaults | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `LLM_MODEL` | `gpt-4o` | Default model when `--model` is not specified | | |
| | `LLM_TEMPERATURE` | `0.0` | Default sampling temperature | | |
| | `LLM_MAX_TOKENS` | `1024` | Default max tokens per response | | |
| ### Environment Server | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `OPENENV_PORT` | `8000` | OpenEnv server port (exposed) | | |
| | `MAX_CONCURRENT_ENVS` | `4` | Max parallel evaluation sessions | | |
| | `ENABLE_WEB_INTERFACE` | `true` | Enable HF Spaces web UI | | |
| | `RENDER_MODE` | `svg` | Board rendering format | | |
| | `MAX_BOARD_SIZE` | `12` | Maximum supported board dimension | | |
| ## Concurrent Sessions | |
| Each evaluation session gets its own isolated `GameEngine` instance. Multiple agents can evaluate simultaneously against the same Docker container without interference. | |
| ## Results | |
| See `comparison.md` for the full 5-model x 2-reward-mode comparison. SOTA average is well below the 0.6-0.7 target band, confirming the gym's difficulty. | |
| | Reward Mode | SOTA Average | All Models Average | | |
| |---|:---:|:---:| | |
| | Custom | -0.14 | -0.14 | | |
| | OpenEnv | 0.28 | 0.28 | | |