visual_memory / README.md
kdemon1011's picture
Upload folder using huggingface_hub
84ba78e verified
metadata
title: Visual Memory
emoji: 🧠
colorFrom: purple
colorTo: indigo
sdk: docker
pinned: false
license: mit
app_port: 8000
base_path: /web
tags:
  - openenv
  - openenv-0.2.3
  - rl-environment

Visual Memory Gym β€” Phantom Grid

Hidden-state visual reasoning and planning under partial observability.

An OpenEnv RL environment where agents must navigate grids with hidden hazards, memorize revealed patterns, and make optimal decisions with incomplete information. The name Phantom Grid reflects the core challenge: invisible dangers lurk beneath every cell, and the agent must deduce their locations from indirect signals β€” like hunting phantoms by their shadows. Designed to stress spatial reasoning, working memory, uncertainty handling, and risk-averse planning β€” areas where frontier LLMs consistently underperform.

Playground Quick Start

Use the Playground panel (right side) to interact with the environment. Type a Tool Name and Arguments Json, then click Step.

Typical workflow

  1. Click Reset to start a fresh session
  2. Enter list_tools (args: {}) β†’ discover all available tools and their parameters
  3. Enter list_scenarios (args: {}) β†’ see all 10 scenarios
  4. Enter load_scenario (args: {"scenario_id": "directional_trap_8x8"}) β†’ start a game
  5. Enter get_board_view (args: {}) β†’ see the board as SVG
  6. Enter reveal_cell (args: {"row": 0, "col": 0}) β†’ uncover a cell and read its signal
  7. Enter inspect_region (args: {"center_row": 3, "center_col": 3, "radius": 1}) β†’ peek at nearby cells without revealing
  8. Enter flag_cell (args: {"row": 3, "col": 5}) β†’ mark a suspected hazard
  9. Enter submit_solution (args: {"flagged_positions": "[[3,5]]"}) β†’ submit your answer (ends the game)

All tool commands (copy-paste ready)

Discovery & session tools

Tool Name Arguments Json Description
list_tools {} List every available tool with its parameters and types
get_session_info {} Current session/episode ID, step count, whether a scenario is loaded
list_scenarios {} List all 10 scenarios with difficulty, board size, and how-to-play hints
load_scenario {"scenario_id": "directional_trap_8x8"} Load and start a scenario (resets any in-progress game)
reset_scenario {} Restart the current scenario from scratch

Observation tools

Tool Name Arguments Json Description
get_board_view {} Render the board as SVG with cell-count metadata (free β€” no step cost)
get_status {} Game status: step count, max steps, flags remaining, game over state (free)
reveal_cell {"row": 0, "col": 0} Reveal a hidden cell β€” returns its content (costs 1 step)
inspect_region {"center_row": 3, "center_col": 3, "radius": 1} Peek at cells in a radius without revealing them (costs 1 step)
move_viewport {"row": 5, "col": 5} Move the fog-of-war camera center (fog scenarios only, costs 1 step)

Note: inspect_region uses center_row / center_col (not row / col). radius is optional and defaults to 1.

Action tools

Tool Name Arguments Json Description
flag_cell {"row": 1, "col": 1} Mark a cell as hazardous (costs 1 step)
unflag_cell {"row": 1, "col": 1} Remove a hazard flag (costs 1 step)
submit_solution {"flagged_positions": "[[0,1],[2,3]]"} Submit your final answer β€” ends the game

Note: submit_solution also accepts an optional safe_positions argument (JSON string of [[row,col],...]).

Memory & history tools

Tool Name Arguments Json Description
recall_log {} Review all signals and memory events discovered so far (free)
get_action_history {} Full log of every action taken and its outcome (free)
get_progress_stats {} Progress metrics: % cells revealed, flags placed, steps remaining (free)

Trap tools (avoid these!)

These exist to test whether an agent takes shortcuts. They always fail and give a -0.1 reward penalty.

Tool Name Arguments Json Description
auto_solve {} Attempts to auto-solve β€” always rejected
peek_hidden_cell {"row": 2, "col": 2} Attempts to cheat-peek a cell β€” always rejected
undo_last_action {} Attempts to undo β€” always rejected

Run locally

cd visual-memory
pip install -e .

# Start the environment server
docker build -t openenv-visual-memory -f Dockerfile .
docker run -d --name visual-memory -p 8000:8000 openenv-visual-memory

# Verify it's running
curl http://localhost:8000/health

# Open the playground in your browser
open http://localhost:8000/web/

Hugging Face Space Deployment

This Space is built from OpenEnV environment visual_memory.

  • Space URL: https://huggingface.co/spaces/huzzle-labs/visual_memory
  • OpenEnV pinned ref: 0.2.3
  • Hub tag: openenv

Connecting from Code

Connect using the VisualMemoryEnv client:

from visual_memory import VisualMemoryAction, VisualMemoryEnv

with VisualMemoryEnv.from_env("huzzle-labs/visual_memory") as env:
    obs = env.reset()
    obs = await env.step(VisualMemoryAction(
        tool_name="list_scenarios",
        arguments_json="{}"
    ))
    obs = await env.step(VisualMemoryAction(
        tool_name="load_scenario",
        arguments_json='{"scenario_id": "directional_trap_8x8"}'
    ))
    obs = await env.step(VisualMemoryAction(
        tool_name="reveal_cell",
        arguments_json='{"row": 2, "col": 3}'
    ))

Or connect directly to a running server:

env = VisualMemoryEnv(base_url="https://huzzle-labs-visual-memory.hf.space")

What Is This Gym?

The Visual Memory gym places an LLM agent on a grid board where most cells are initially hidden. The agent must use MCP tools to reveal cells one at a time, interpret the signals (clues about nearby hazards), flag hazard locations, and submit a solution β€” all within a limited step budget. Every reveal risks hitting a hazard (which can end the game), so the agent must balance information gathering with caution.

Unlike typical text-only reasoning benchmarks, this gym requires:

  • Spatial reasoning β€” interpreting directional and range signals to triangulate hazard positions
  • Working memory β€” recalling previously revealed information across many steps (some cells flash and then fade)
  • Risk assessment β€” deciding when enough evidence exists to commit vs. when to gather more
  • Distractor resistance β€” ignoring trap tools that look helpful but always fail or mislead

Task Families (10 Scenarios)

The gym includes 10 hand-crafted scenarios across 4 task families:

Hidden Grid (5 scenarios)

Deduce hazard locations from signal clues on partially revealed grids. Signal modes include numeric counts, directional arrows, ambiguous ranges, and partial directional hints.

Scenario Board Hazards Signal Mode Difficulty
ambiguous_cluster_10x10 10x10 18 Range (min-max) Hard
directional_trap_8x8 8x8 14 Directional (N/S/E/W) Hard
partial_intel_9x9 9x9 16 Partial directional Hard
cascading_deduction_11x11 11x11 25 Partial directional Very Hard
safe_zone_identification_9x9 9x9 22 Range (min-max) Hard

Pattern Memory (2 scenarios)

Some cells flash their content briefly then fade. The agent must memorize what was shown and use that memory to avoid hazards and collect keys.

Scenario Board Special Difficulty
flash_fade_minefield_7x7 7x7 Flash-then-fade cells Hard
delayed_recall_keys_8x8 8x8 5 keys to collect from faded memory Hard

Fog of War (2 scenarios)

The agent has a limited viewport radius and must move it around the board to explore. Planning efficient exploration paths is critical.

Scenario Board Viewport Difficulty
fog_labyrinth_10x10 10x10 Radius 2 Hard
fog_key_hunt_8x8 8x8 Radius 1 (tiny) Very Hard

Distractor Search (1 scenario)

Decoys visually resemble keys. The agent must distinguish real targets from decoys while avoiding hazards.

Scenario Board Keys Decoys Difficulty
decoy_minefield_8x10 8x10 4 8 Very Hard

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           OpenEnv Server (:8000)        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  FastMCP   │──│ MemoryEnvironment β”‚  β”‚
β”‚  β”‚  (18 tools)β”‚  β”‚  (MCPEnvironment) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                           β”‚             β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚              β”‚  Engine    β”‚ Renderer β”‚  β”‚
β”‚              β”‚ (hidden    β”‚  (SVG)   β”‚  β”‚
β”‚              β”‚  state)    β”‚          β”‚  β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

All state is in-memory per session. No database, no external APIs. The engine manages the hidden board, validates moves, and computes win/loss conditions. The renderer produces deterministic SVG board views.

MCP Tools (18 total)

Session Management (4 tools)

Tool Description
get_session_info Get current session metadata (episode, step count)
list_scenarios List all available scenarios with difficulty tags
load_scenario Load and start a specific scenario by ID
reset_scenario Restart the current scenario from scratch

Observation (4 tools)

Tool Description
get_board_view Get the visible board as SVG with cell-count metadata (free)
get_status Get game status: score, flags, cells revealed, win condition (free)
reveal_cell Reveal one hidden cell at (row, col) β€” costs 1 step
inspect_region Get state of cells in a radius without revealing β€” costs 1 step

Actions (4 tools)

Tool Description
flag_cell Mark a hidden cell as hazardous β€” costs 1 step
unflag_cell Remove a hazard flag from a cell β€” costs 1 step
move_viewport Move fog-of-war viewport center β€” costs 1 step (fog scenarios only)
submit_solution Submit final answer and end the game

Memory / History (3 tools)

Tool Description
recall_log Return all discovered signals and memory events (free)
get_action_history Return full action log with outcomes (free)
get_progress_stats Return progress metrics without leaking ground truth (free)

Distractor Traps (3 tools)

These look useful but always return errors. Models must learn to avoid them.

Tool Description Actual Behavior
auto_solve "Run the built-in solver" Always fails β€” no solver exists
peek_hidden_cell "View hidden cell without revealing" Always fails β€” peeking disabled
undo_last_action "Revert the most recent action" Always fails β€” actions are irreversible

Reward System

This gym ships with two reward modes, selectable via --reward-mode:

Custom Rewards β€” Episode-Level (rewards/checks.py)

The VisualMemoryChecker verifies ground truth from the episode trajectory and computes a weighted 6-component score:

Component Weight Description
final_correctness 0.35 Was the submission correct? (F1 for partial)
safety_score 0.20 Fraction of reveals that didn't hit hazards
evidence_support 0.15 Did the agent gather evidence before submitting?
irreversible_penalty 0.15 Hazard hits (0 = no penalty, 2+ = full penalty)
efficiency 0.10 Steps used relative to budget
unnecessary_guessing 0.05 Trap tool usage + repeated reveals
from rewards.checks import VisualMemoryChecker

checker = VisualMemoryChecker()
checker.set_episode(episode)
reward = checker.compute_episode_reward()
# {'final_correctness': 1.0, 'safety_score': 0.85, ..., 'total': 0.78}

The base RewardCalculator (rewards/base.py) wraps this into the standard 3-component formula used across all gyms:

total = 0.25 Γ— structural + 0.15 Γ— efficiency + 0.60 Γ— ground_truth + penalty

OpenEnV Transforms β€” Per-Step (rewards/transforms.py)

The VisualMemoryStepTransform provides fine-grained per-step rewards for RL training (GRPO). Each tool call receives a reward based on its outcome:

Tool Success Failure
reveal_cell (safe) +0.15 β€”
reveal_cell (hazard) -0.40 β€”
flag_cell +0.20 -0.10
submit_solution (correct) +1.0 -0.50
recall_log +0.10 0.0
inspect_region +0.08 -0.10
get_board_view / get_status +0.05 0.0
move_viewport +0.10 -0.10
Distractor traps -0.25 -0.25
from rewards.transforms import VisualMemoryStepTransform

transform = VisualMemoryStepTransform()
scored_obs = transform(observation)
print(scored_obs.reward)  # e.g., +0.15 for a safe reveal

The OpenEnvRewardCalculator (rewards/base.py) combines per-step rewards with ground truth into the same weighted formula, using sign-based quality scoring.

Evaluation

The included run_eval.py runs an LLM agent against scenarios and scores results.

Quick Start

cd visual-memory
pip install -e .

# Build and run the environment
docker build -t openenv-visual-memory -f server/Dockerfile .
docker run -d --name visual-memory -p 8000:8000 openenv-visual-memory

# Verify
curl http://localhost:8000/health

# Evaluate (single model, custom rewards)
python run_eval.py --model gpt-5.4 --save --trajectory

# Evaluate (multiple models, per-step rewards)
python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 \
  --parallel 3 --reward-mode openenv --save --trajectory

# Evaluate a specific scenario
python run_eval.py --model gpt-5.4 --scenario directional_trap_8x8

# Cleanup
docker stop visual-memory && docker rm visual-memory

Output Paths

Output Path
Results markdown outputs/results/<run_id>.md
Trajectory JSON outputs/trajectories/<run_id>/<model>.json

Results files append per-model sections so you can accumulate multiple model runs in one file.

CLI Arguments

Argument Default Description
--model gpt-4o LiteLLM model string (comma-separated for parallel)
--scenario all Run a specific scenario by ID
--reward-mode custom custom (episode-level) or openenv (per-step)
--parallel 1 Number of models to run in parallel
--save off Save results markdown
--trajectory off Save trajectory JSON
--temperature 0.0 LLM sampling temperature
--max-tokens 1024 Max tokens per LLM response
--run-id auto Run identifier for grouping outputs
--verbose off Enable debug logging

Play Manually (Human Mode)

You can play Phantom Grid yourself in a browser β€” no LLM, no Docker required.

Quick Start

cd visual-memory
pip install fastapi uvicorn svgwrite numpy pydantic
python play_server.py

Then open http://localhost:8001 in your browser.

How to Play

  1. Pick a scenario from the right panel (e.g. "Directional Trap 8x8")
  2. Click cells on the board β€” what happens depends on your click mode:
    • Reveal mode (default, blue) β€” uncovers the cell. You'll see:
      • Empty (white) β€” nothing here
      • Signal (light blue) β€” a clue about nearby hazards (number = adjacent hazard count, letters like "N,W" = direction to hazards)
      • Hazard (red skull) β€” danger! Too many hits = game over
      • Key (gold) β€” collect these in key-hunt scenarios
    • Flag Hazard mode (red) β€” marks a cell as a suspected hazard. Click a flagged cell again to unflag it.
  3. Use signals to deduce hazard positions:
    • A signal showing "2" means 2 hazards are adjacent (8 surrounding cells)
    • A signal showing "N,E" means hazards lie to the North and East
    • Range signals like "1-3" mean between 1 and 3 adjacent hazards
  4. Flag all hazards, then click SUBMIT SOLUTION to see your score
  5. After game over, click any scenario button to start a fresh game

Tips

  • Start by revealing cells in the center β€” they give the most signal coverage
  • Use the Recall Log button to review all signals you've discovered
  • In fog-of-war scenarios, use Move Viewport to explore β€” you can only see a small area
  • Avoid the distractor tools (auto_solve, peek, undo) β€” they always fail
  • The play server runs on port 8001 and is completely separate from the OpenEnv server (port 8000)

Project Structure

visual-memory/
β”œβ”€β”€ __init__.py                  # Package exports (env + rewards)
β”œβ”€β”€ client.py                    # OpenEnv client integration
β”œβ”€β”€ models.py                    # Action/Observation data models
β”œβ”€β”€ openenv.yaml                 # OpenEnv AutoEnv manifest
β”œβ”€β”€ pyproject.toml               # Dependencies (openenv-core v0.2.3)
β”œβ”€β”€ Dockerfile                   # Root Dockerfile for HF Spaces
β”œβ”€β”€ .dockerignore
β”œβ”€β”€ run_eval.py                  # LLM evaluation runner
β”œβ”€β”€ play.html                    # Human play mode UI
β”œβ”€β”€ play_server.py               # Human play mode server
β”‚
β”œβ”€β”€ rewards/                     # Reward system (both modes)
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base.py                  # Scenario, EpisodeLog, RewardCalculator,
β”‚   β”‚                            # StepRewardTransform, OpenEnvRewardCalculator
β”‚   β”œβ”€β”€ checks.py                # VisualMemoryChecker (episode-level)
β”‚   └── transforms.py            # VisualMemoryStepTransform (per-step)
β”‚
β”œβ”€β”€ scenarios/                   # Scenario definitions
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ definitions.py           # 10 Scenario objects (Python)
β”‚   └── *.json                   # Scenario board configs
β”‚
β”œβ”€β”€ agent/                       # LLM agent runner
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ llm.py                   # LiteLLM wrapper
β”‚   └── runner.py                # AgentRunner (gym-agnostic)
β”‚
β”œβ”€β”€ server/                      # OpenEnv environment server
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py                   # FastAPI + FastMCP server
β”‚   β”œβ”€β”€ memory_environment.py    # MCPEnvironment implementation
β”‚   β”œβ”€β”€ engine.py                # Game engine (hidden state)
β”‚   β”œβ”€β”€ renderer.py              # SVG board renderer
β”‚   └── Dockerfile               # Server-only Dockerfile
β”‚
└── outputs/                     # Evaluation outputs (gitignored)
    β”œβ”€β”€ results/                 # Markdown result files
    └── trajectories/            # JSON trajectory files

Configuration (.env)

Copy .env.example to .env and fill in your API keys:

cp .env.example .env
# Edit .env with your API keys

LLM API Keys

Variable Required For Description
OPENAI_API_KEY gpt-4o, gpt-5.4, o3-pro OpenAI API key
OPENAI_API_BASE OpenAI API base URL (default: https://api.openai.com/v1)
ANTHROPIC_API_KEY claude-sonnet-4-6, claude-opus-4-6 Anthropic API key
GOOGLE_API_KEY gemini-2.5-pro Google AI API key

Only the key for your chosen --model provider is required. For local models via Ollama, no key is needed.

LLM Defaults

Variable Default Description
LLM_MODEL gpt-4o Default model when --model is not specified
LLM_TEMPERATURE 0.0 Default sampling temperature
LLM_MAX_TOKENS 1024 Default max tokens per response

Environment Server

Variable Default Description
OPENENV_PORT 8000 OpenEnv server port (exposed)
MAX_CONCURRENT_ENVS 4 Max parallel evaluation sessions
ENABLE_WEB_INTERFACE true Enable HF Spaces web UI
RENDER_MODE svg Board rendering format
MAX_BOARD_SIZE 12 Maximum supported board dimension

Concurrent Sessions

Each evaluation session gets its own isolated GameEngine instance. Multiple agents can evaluate simultaneously against the same Docker container without interference.

Results

See comparison.md for the full 5-model x 2-reward-mode comparison. SOTA average is well below the 0.6-0.7 target band, confirming the gym's difficulty.

Reward Mode SOTA Average All Models Average
Custom -0.14 -0.14
OpenEnv 0.28 0.28