visual_memory / README.md
kdemon1011's picture
Upload folder using huggingface_hub
84ba78e verified
---
title: Visual Memory
emoji: 🧠
colorFrom: purple
colorTo: indigo
sdk: docker
pinned: false
license: mit
app_port: 8000
base_path: /web
tags:
- openenv
- openenv-0.2.3
- rl-environment
---
# Visual Memory Gym β€” *Phantom Grid*
**Hidden-state visual reasoning and planning under partial observability.**
An OpenEnv RL environment where agents must navigate grids with hidden hazards, memorize revealed patterns, and make optimal decisions with incomplete information. The name *Phantom Grid* reflects the core challenge: invisible dangers lurk beneath every cell, and the agent must deduce their locations from indirect signals β€” like hunting phantoms by their shadows. Designed to stress spatial reasoning, working memory, uncertainty handling, and risk-averse planning β€” areas where frontier LLMs consistently underperform.
## Playground Quick Start
Use the **Playground** panel (right side) to interact with the environment. Type a **Tool Name** and **Arguments Json**, then click **Step**.
### Typical workflow
1. Click **Reset** to start a fresh session
2. Enter `list_tools` (args: `{}`) β†’ discover all available tools and their parameters
3. Enter `list_scenarios` (args: `{}`) β†’ see all 10 scenarios
4. Enter `load_scenario` (args: `{"scenario_id": "directional_trap_8x8"}`) β†’ start a game
5. Enter `get_board_view` (args: `{}`) β†’ see the board as SVG
6. Enter `reveal_cell` (args: `{"row": 0, "col": 0}`) β†’ uncover a cell and read its signal
7. Enter `inspect_region` (args: `{"center_row": 3, "center_col": 3, "radius": 1}`) β†’ peek at nearby cells without revealing
8. Enter `flag_cell` (args: `{"row": 3, "col": 5}`) β†’ mark a suspected hazard
9. Enter `submit_solution` (args: `{"flagged_positions": "[[3,5]]"}`) β†’ submit your answer (ends the game)
### All tool commands (copy-paste ready)
#### Discovery & session tools
| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `list_tools` | `{}` | List every available tool with its parameters and types |
| `get_session_info` | `{}` | Current session/episode ID, step count, whether a scenario is loaded |
| `list_scenarios` | `{}` | List all 10 scenarios with difficulty, board size, and how-to-play hints |
| `load_scenario` | `{"scenario_id": "directional_trap_8x8"}` | Load and start a scenario (resets any in-progress game) |
| `reset_scenario` | `{}` | Restart the current scenario from scratch |
#### Observation tools
| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `get_board_view` | `{}` | Render the board as SVG with cell-count metadata (free β€” no step cost) |
| `get_status` | `{}` | Game status: step count, max steps, flags remaining, game over state (free) |
| `reveal_cell` | `{"row": 0, "col": 0}` | Reveal a hidden cell β€” returns its content (costs 1 step) |
| `inspect_region` | `{"center_row": 3, "center_col": 3, "radius": 1}` | Peek at cells in a radius without revealing them (costs 1 step) |
| `move_viewport` | `{"row": 5, "col": 5}` | Move the fog-of-war camera center (fog scenarios only, costs 1 step) |
> **Note:** `inspect_region` uses `center_row` / `center_col` (not `row` / `col`). `radius` is optional and defaults to `1`.
#### Action tools
| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `flag_cell` | `{"row": 1, "col": 1}` | Mark a cell as hazardous (costs 1 step) |
| `unflag_cell` | `{"row": 1, "col": 1}` | Remove a hazard flag (costs 1 step) |
| `submit_solution` | `{"flagged_positions": "[[0,1],[2,3]]"}` | Submit your final answer β€” ends the game |
> **Note:** `submit_solution` also accepts an optional `safe_positions` argument (JSON string of `[[row,col],...]`).
#### Memory & history tools
| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `recall_log` | `{}` | Review all signals and memory events discovered so far (free) |
| `get_action_history` | `{}` | Full log of every action taken and its outcome (free) |
| `get_progress_stats` | `{}` | Progress metrics: % cells revealed, flags placed, steps remaining (free) |
#### Trap tools (avoid these!)
These exist to test whether an agent takes shortcuts. They always fail and give a **-0.1 reward penalty**.
| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `auto_solve` | `{}` | Attempts to auto-solve β€” always rejected |
| `peek_hidden_cell` | `{"row": 2, "col": 2}` | Attempts to cheat-peek a cell β€” always rejected |
| `undo_last_action` | `{}` | Attempts to undo β€” always rejected |
### Run locally
```bash
cd visual-memory
pip install -e .
# Start the environment server
docker build -t openenv-visual-memory -f Dockerfile .
docker run -d --name visual-memory -p 8000:8000 openenv-visual-memory
# Verify it's running
curl http://localhost:8000/health
# Open the playground in your browser
open http://localhost:8000/web/
```
## Hugging Face Space Deployment
This Space is built from OpenEnV environment `visual_memory`.
- **Space URL**: `https://huggingface.co/spaces/huzzle-labs/visual_memory`
- **OpenEnV pinned ref**: `0.2.3`
- **Hub tag**: `openenv`
### Connecting from Code
Connect using the `VisualMemoryEnv` client:
```python
from visual_memory import VisualMemoryAction, VisualMemoryEnv
with VisualMemoryEnv.from_env("huzzle-labs/visual_memory") as env:
obs = env.reset()
obs = await env.step(VisualMemoryAction(
tool_name="list_scenarios",
arguments_json="{}"
))
obs = await env.step(VisualMemoryAction(
tool_name="load_scenario",
arguments_json='{"scenario_id": "directional_trap_8x8"}'
))
obs = await env.step(VisualMemoryAction(
tool_name="reveal_cell",
arguments_json='{"row": 2, "col": 3}'
))
```
Or connect directly to a running server:
```python
env = VisualMemoryEnv(base_url="https://huzzle-labs-visual-memory.hf.space")
```
## What Is This Gym?
The Visual Memory gym places an LLM agent on a grid board where most cells are initially hidden. The agent must use MCP tools to reveal cells one at a time, interpret the signals (clues about nearby hazards), flag hazard locations, and submit a solution β€” all within a limited step budget. Every reveal risks hitting a hazard (which can end the game), so the agent must balance information gathering with caution.
Unlike typical text-only reasoning benchmarks, this gym requires:
- **Spatial reasoning** β€” interpreting directional and range signals to triangulate hazard positions
- **Working memory** β€” recalling previously revealed information across many steps (some cells flash and then fade)
- **Risk assessment** β€” deciding when enough evidence exists to commit vs. when to gather more
- **Distractor resistance** β€” ignoring trap tools that look helpful but always fail or mislead
## Task Families (10 Scenarios)
The gym includes 10 hand-crafted scenarios across 4 task families:
### Hidden Grid (5 scenarios)
Deduce hazard locations from signal clues on partially revealed grids. Signal modes include numeric counts, directional arrows, ambiguous ranges, and partial directional hints.
| Scenario | Board | Hazards | Signal Mode | Difficulty |
|---|---|---|---|---|
| `ambiguous_cluster_10x10` | 10x10 | 18 | Range (min-max) | Hard |
| `directional_trap_8x8` | 8x8 | 14 | Directional (N/S/E/W) | Hard |
| `partial_intel_9x9` | 9x9 | 16 | Partial directional | Hard |
| `cascading_deduction_11x11` | 11x11 | 25 | Partial directional | Very Hard |
| `safe_zone_identification_9x9` | 9x9 | 22 | Range (min-max) | Hard |
### Pattern Memory (2 scenarios)
Some cells flash their content briefly then fade. The agent must memorize what was shown and use that memory to avoid hazards and collect keys.
| Scenario | Board | Special | Difficulty |
|---|---|---|---|
| `flash_fade_minefield_7x7` | 7x7 | Flash-then-fade cells | Hard |
| `delayed_recall_keys_8x8` | 8x8 | 5 keys to collect from faded memory | Hard |
### Fog of War (2 scenarios)
The agent has a limited viewport radius and must move it around the board to explore. Planning efficient exploration paths is critical.
| Scenario | Board | Viewport | Difficulty |
|---|---|---|---|
| `fog_labyrinth_10x10` | 10x10 | Radius 2 | Hard |
| `fog_key_hunt_8x8` | 8x8 | Radius 1 (tiny) | Very Hard |
### Distractor Search (1 scenario)
Decoys visually resemble keys. The agent must distinguish real targets from decoys while avoiding hazards.
| Scenario | Board | Keys | Decoys | Difficulty |
|---|---|---|---|---|
| `decoy_minefield_8x10` | 8x10 | 4 | 8 | Very Hard |
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OpenEnv Server (:8000) β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ FastMCP │──│ MemoryEnvironment β”‚ β”‚
β”‚ β”‚ (18 tools)β”‚ β”‚ (MCPEnvironment) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Engine β”‚ Renderer β”‚ β”‚
β”‚ β”‚ (hidden β”‚ (SVG) β”‚ β”‚
β”‚ β”‚ state) β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
All state is in-memory per session. No database, no external APIs. The engine manages the hidden board, validates moves, and computes win/loss conditions. The renderer produces deterministic SVG board views.
## MCP Tools (18 total)
### Session Management (4 tools)
| Tool | Description |
|------|-------------|
| `get_session_info` | Get current session metadata (episode, step count) |
| `list_scenarios` | List all available scenarios with difficulty tags |
| `load_scenario` | Load and start a specific scenario by ID |
| `reset_scenario` | Restart the current scenario from scratch |
### Observation (4 tools)
| Tool | Description |
|------|-------------|
| `get_board_view` | Get the visible board as SVG with cell-count metadata (free) |
| `get_status` | Get game status: score, flags, cells revealed, win condition (free) |
| `reveal_cell` | Reveal one hidden cell at (row, col) β€” costs 1 step |
| `inspect_region` | Get state of cells in a radius without revealing β€” costs 1 step |
### Actions (4 tools)
| Tool | Description |
|------|-------------|
| `flag_cell` | Mark a hidden cell as hazardous β€” costs 1 step |
| `unflag_cell` | Remove a hazard flag from a cell β€” costs 1 step |
| `move_viewport` | Move fog-of-war viewport center β€” costs 1 step (fog scenarios only) |
| `submit_solution` | Submit final answer and end the game |
### Memory / History (3 tools)
| Tool | Description |
|------|-------------|
| `recall_log` | Return all discovered signals and memory events (free) |
| `get_action_history` | Return full action log with outcomes (free) |
| `get_progress_stats` | Return progress metrics without leaking ground truth (free) |
### Distractor Traps (3 tools)
These look useful but always return errors. Models must learn to avoid them.
| Tool | Description | Actual Behavior |
|------|-------------|-----------------|
| `auto_solve` | "Run the built-in solver" | Always fails β€” no solver exists |
| `peek_hidden_cell` | "View hidden cell without revealing" | Always fails β€” peeking disabled |
| `undo_last_action` | "Revert the most recent action" | Always fails β€” actions are irreversible |
## Reward System
This gym ships with **two** reward modes, selectable via `--reward-mode`:
### Custom Rewards β€” Episode-Level (`rewards/checks.py`)
The `VisualMemoryChecker` verifies ground truth from the episode trajectory and computes a weighted 6-component score:
| Component | Weight | Description |
|---|---|---|
| `final_correctness` | 0.35 | Was the submission correct? (F1 for partial) |
| `safety_score` | 0.20 | Fraction of reveals that didn't hit hazards |
| `evidence_support` | 0.15 | Did the agent gather evidence before submitting? |
| `irreversible_penalty` | 0.15 | Hazard hits (0 = no penalty, 2+ = full penalty) |
| `efficiency` | 0.10 | Steps used relative to budget |
| `unnecessary_guessing` | 0.05 | Trap tool usage + repeated reveals |
```python
from rewards.checks import VisualMemoryChecker
checker = VisualMemoryChecker()
checker.set_episode(episode)
reward = checker.compute_episode_reward()
# {'final_correctness': 1.0, 'safety_score': 0.85, ..., 'total': 0.78}
```
The base `RewardCalculator` (`rewards/base.py`) wraps this into the standard 3-component formula used across all gyms:
```
total = 0.25 Γ— structural + 0.15 Γ— efficiency + 0.60 Γ— ground_truth + penalty
```
### OpenEnV Transforms β€” Per-Step (`rewards/transforms.py`)
The `VisualMemoryStepTransform` provides fine-grained per-step rewards for RL training (GRPO). Each tool call receives a reward based on its outcome:
| Tool | Success | Failure |
|---|---|---|
| `reveal_cell` (safe) | +0.15 | β€” |
| `reveal_cell` (hazard) | -0.40 | β€” |
| `flag_cell` | +0.20 | -0.10 |
| `submit_solution` (correct) | +1.0 | -0.50 |
| `recall_log` | +0.10 | 0.0 |
| `inspect_region` | +0.08 | -0.10 |
| `get_board_view` / `get_status` | +0.05 | 0.0 |
| `move_viewport` | +0.10 | -0.10 |
| Distractor traps | -0.25 | -0.25 |
```python
from rewards.transforms import VisualMemoryStepTransform
transform = VisualMemoryStepTransform()
scored_obs = transform(observation)
print(scored_obs.reward) # e.g., +0.15 for a safe reveal
```
The `OpenEnvRewardCalculator` (`rewards/base.py`) combines per-step rewards with ground truth into the same weighted formula, using sign-based quality scoring.
## Evaluation
The included `run_eval.py` runs an LLM agent against scenarios and scores results.
### Quick Start
```bash
cd visual-memory
pip install -e .
# Build and run the environment
docker build -t openenv-visual-memory -f server/Dockerfile .
docker run -d --name visual-memory -p 8000:8000 openenv-visual-memory
# Verify
curl http://localhost:8000/health
# Evaluate (single model, custom rewards)
python run_eval.py --model gpt-5.4 --save --trajectory
# Evaluate (multiple models, per-step rewards)
python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 \
--parallel 3 --reward-mode openenv --save --trajectory
# Evaluate a specific scenario
python run_eval.py --model gpt-5.4 --scenario directional_trap_8x8
# Cleanup
docker stop visual-memory && docker rm visual-memory
```
### Output Paths
| Output | Path |
|---|---|
| Results markdown | `outputs/results/<run_id>.md` |
| Trajectory JSON | `outputs/trajectories/<run_id>/<model>.json` |
Results files append per-model sections so you can accumulate multiple model runs in one file.
### CLI Arguments
| Argument | Default | Description |
|---|---|---|
| `--model` | `gpt-4o` | LiteLLM model string (comma-separated for parallel) |
| `--scenario` | all | Run a specific scenario by ID |
| `--reward-mode` | `custom` | `custom` (episode-level) or `openenv` (per-step) |
| `--parallel` | `1` | Number of models to run in parallel |
| `--save` | off | Save results markdown |
| `--trajectory` | off | Save trajectory JSON |
| `--temperature` | `0.0` | LLM sampling temperature |
| `--max-tokens` | `1024` | Max tokens per LLM response |
| `--run-id` | auto | Run identifier for grouping outputs |
| `--verbose` | off | Enable debug logging |
## Play Manually (Human Mode)
You can play Phantom Grid yourself in a browser β€” no LLM, no Docker required.
### Quick Start
```bash
cd visual-memory
pip install fastapi uvicorn svgwrite numpy pydantic
python play_server.py
```
Then open **http://localhost:8001** in your browser.
### How to Play
1. **Pick a scenario** from the right panel (e.g. "Directional Trap 8x8")
2. **Click cells** on the board β€” what happens depends on your click mode:
- **Reveal** mode (default, blue) β€” uncovers the cell. You'll see:
- Empty (white) β€” nothing here
- Signal (light blue) β€” a clue about nearby hazards (number = adjacent hazard count, letters like "N,W" = direction to hazards)
- Hazard (red skull) β€” danger! Too many hits = game over
- Key (gold) β€” collect these in key-hunt scenarios
- **Flag Hazard** mode (red) β€” marks a cell as a suspected hazard. Click a flagged cell again to unflag it.
3. **Use signals** to deduce hazard positions:
- A signal showing "2" means 2 hazards are adjacent (8 surrounding cells)
- A signal showing "N,E" means hazards lie to the North and East
- Range signals like "1-3" mean between 1 and 3 adjacent hazards
4. **Flag all hazards**, then click **SUBMIT SOLUTION** to see your score
5. After game over, click any scenario button to **start a fresh game**
### Tips
- Start by revealing cells in the center β€” they give the most signal coverage
- Use the **Recall Log** button to review all signals you've discovered
- In fog-of-war scenarios, use **Move Viewport** to explore β€” you can only see a small area
- Avoid the distractor tools (auto_solve, peek, undo) β€” they always fail
- The play server runs on **port 8001** and is completely separate from the OpenEnv server (port 8000)
## Project Structure
```
visual-memory/
β”œβ”€β”€ __init__.py # Package exports (env + rewards)
β”œβ”€β”€ client.py # OpenEnv client integration
β”œβ”€β”€ models.py # Action/Observation data models
β”œβ”€β”€ openenv.yaml # OpenEnv AutoEnv manifest
β”œβ”€β”€ pyproject.toml # Dependencies (openenv-core v0.2.3)
β”œβ”€β”€ Dockerfile # Root Dockerfile for HF Spaces
β”œβ”€β”€ .dockerignore
β”œβ”€β”€ run_eval.py # LLM evaluation runner
β”œβ”€β”€ play.html # Human play mode UI
β”œβ”€β”€ play_server.py # Human play mode server
β”‚
β”œβ”€β”€ rewards/ # Reward system (both modes)
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ base.py # Scenario, EpisodeLog, RewardCalculator,
β”‚ β”‚ # StepRewardTransform, OpenEnvRewardCalculator
β”‚ β”œβ”€β”€ checks.py # VisualMemoryChecker (episode-level)
β”‚ └── transforms.py # VisualMemoryStepTransform (per-step)
β”‚
β”œβ”€β”€ scenarios/ # Scenario definitions
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ definitions.py # 10 Scenario objects (Python)
β”‚ └── *.json # Scenario board configs
β”‚
β”œβ”€β”€ agent/ # LLM agent runner
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ llm.py # LiteLLM wrapper
β”‚ └── runner.py # AgentRunner (gym-agnostic)
β”‚
β”œβ”€β”€ server/ # OpenEnv environment server
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ app.py # FastAPI + FastMCP server
β”‚ β”œβ”€β”€ memory_environment.py # MCPEnvironment implementation
β”‚ β”œβ”€β”€ engine.py # Game engine (hidden state)
β”‚ β”œβ”€β”€ renderer.py # SVG board renderer
β”‚ └── Dockerfile # Server-only Dockerfile
β”‚
└── outputs/ # Evaluation outputs (gitignored)
β”œβ”€β”€ results/ # Markdown result files
└── trajectories/ # JSON trajectory files
```
## Configuration (.env)
Copy `.env.example` to `.env` and fill in your API keys:
```bash
cp .env.example .env
# Edit .env with your API keys
```
### LLM API Keys
| Variable | Required For | Description |
|----------|---|---|
| `OPENAI_API_KEY` | `gpt-4o`, `gpt-5.4`, `o3-pro` | OpenAI API key |
| `OPENAI_API_BASE` | OpenAI | API base URL (default: `https://api.openai.com/v1`) |
| `ANTHROPIC_API_KEY` | `claude-sonnet-4-6`, `claude-opus-4-6` | Anthropic API key |
| `GOOGLE_API_KEY` | `gemini-2.5-pro` | Google AI API key |
Only the key for your chosen `--model` provider is required. For local models via Ollama, no key is needed.
### LLM Defaults
| Variable | Default | Description |
|----------|---------|-------------|
| `LLM_MODEL` | `gpt-4o` | Default model when `--model` is not specified |
| `LLM_TEMPERATURE` | `0.0` | Default sampling temperature |
| `LLM_MAX_TOKENS` | `1024` | Default max tokens per response |
### Environment Server
| Variable | Default | Description |
|----------|---------|-------------|
| `OPENENV_PORT` | `8000` | OpenEnv server port (exposed) |
| `MAX_CONCURRENT_ENVS` | `4` | Max parallel evaluation sessions |
| `ENABLE_WEB_INTERFACE` | `true` | Enable HF Spaces web UI |
| `RENDER_MODE` | `svg` | Board rendering format |
| `MAX_BOARD_SIZE` | `12` | Maximum supported board dimension |
## Concurrent Sessions
Each evaluation session gets its own isolated `GameEngine` instance. Multiple agents can evaluate simultaneously against the same Docker container without interference.
## Results
See `comparison.md` for the full 5-model x 2-reward-mode comparison. SOTA average is well below the 0.6-0.7 target band, confirming the gym's difficulty.
| Reward Mode | SOTA Average | All Models Average |
|---|:---:|:---:|
| Custom | -0.14 | -0.14 |
| OpenEnv | 0.28 | 0.28 |