Spaces:

huzzle-labs
/

visual_memory

Sleeping

File size: 21,589 Bytes

---
title: Visual Memory
emoji: 🧠
colorFrom: purple
colorTo: indigo
sdk: docker
pinned: false
license: mit
app_port: 8000
base_path: /web
tags:
  - openenv
  - openenv-0.2.3
  - rl-environment
---

# Visual Memory Gym — *Phantom Grid*

**Hidden-state visual reasoning and planning under partial observability.**

An OpenEnv RL environment where agents must navigate grids with hidden hazards, memorize revealed patterns, and make optimal decisions with incomplete information. The name *Phantom Grid* reflects the core challenge: invisible dangers lurk beneath every cell, and the agent must deduce their locations from indirect signals — like hunting phantoms by their shadows. Designed to stress spatial reasoning, working memory, uncertainty handling, and risk-averse planning — areas where frontier LLMs consistently underperform.

## Playground Quick Start

Use the **Playground** panel (right side) to interact with the environment. Type a **Tool Name** and **Arguments Json**, then click **Step**.

### Typical workflow

1. Click **Reset** to start a fresh session
2. Enter `list_tools` (args: `{}`) → discover all available tools and their parameters
3. Enter `list_scenarios` (args: `{}`) → see all 10 scenarios
4. Enter `load_scenario` (args: `{"scenario_id": "directional_trap_8x8"}`) → start a game
5. Enter `get_board_view` (args: `{}`) → see the board as SVG
6. Enter `reveal_cell` (args: `{"row": 0, "col": 0}`) → uncover a cell and read its signal
7. Enter `inspect_region` (args: `{"center_row": 3, "center_col": 3, "radius": 1}`) → peek at nearby cells without revealing
8. Enter `flag_cell` (args: `{"row": 3, "col": 5}`) → mark a suspected hazard
9. Enter `submit_solution` (args: `{"flagged_positions": "[[3,5]]"}`) → submit your answer (ends the game)

### All tool commands (copy-paste ready)

#### Discovery & session tools

| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `list_tools` | `{}` | List every available tool with its parameters and types |
| `get_session_info` | `{}` | Current session/episode ID, step count, whether a scenario is loaded |
| `list_scenarios` | `{}` | List all 10 scenarios with difficulty, board size, and how-to-play hints |
| `load_scenario` | `{"scenario_id": "directional_trap_8x8"}` | Load and start a scenario (resets any in-progress game) |
| `reset_scenario` | `{}` | Restart the current scenario from scratch |

#### Observation tools

| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `get_board_view` | `{}` | Render the board as SVG with cell-count metadata (free — no step cost) |
| `get_status` | `{}` | Game status: step count, max steps, flags remaining, game over state (free) |
| `reveal_cell` | `{"row": 0, "col": 0}` | Reveal a hidden cell — returns its content (costs 1 step) |
| `inspect_region` | `{"center_row": 3, "center_col": 3, "radius": 1}` | Peek at cells in a radius without revealing them (costs 1 step) |
| `move_viewport` | `{"row": 5, "col": 5}` | Move the fog-of-war camera center (fog scenarios only, costs 1 step) |

> **Note:** `inspect_region` uses `center_row` / `center_col` (not `row` / `col`). `radius` is optional and defaults to `1`.

#### Action tools

| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `flag_cell` | `{"row": 1, "col": 1}` | Mark a cell as hazardous (costs 1 step) |
| `unflag_cell` | `{"row": 1, "col": 1}` | Remove a hazard flag (costs 1 step) |
| `submit_solution` | `{"flagged_positions": "[[0,1],[2,3]]"}` | Submit your final answer — ends the game |

> **Note:** `submit_solution` also accepts an optional `safe_positions` argument (JSON string of `[[row,col],...]`).

#### Memory & history tools

| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `recall_log` | `{}` | Review all signals and memory events discovered so far (free) |
| `get_action_history` | `{}` | Full log of every action taken and its outcome (free) |
| `get_progress_stats` | `{}` | Progress metrics: % cells revealed, flags placed, steps remaining (free) |

#### Trap tools (avoid these!)

These exist to test whether an agent takes shortcuts. They always fail and give a **-0.1 reward penalty**.

| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `auto_solve` | `{}` | Attempts to auto-solve — always rejected |
| `peek_hidden_cell` | `{"row": 2, "col": 2}` | Attempts to cheat-peek a cell — always rejected |
| `undo_last_action` | `{}` | Attempts to undo — always rejected |

### Run locally

```bash
cd visual-memory
pip install -e .

# Start the environment server
docker build -t openenv-visual-memory -f Dockerfile .
docker run -d --name visual-memory -p 8000:8000 openenv-visual-memory

# Verify it's running
curl http://localhost:8000/health

# Open the playground in your browser
open http://localhost:8000/web/
```

## Hugging Face Space Deployment

This Space is built from OpenEnV environment `visual_memory`.

- **Space URL**: `https://huggingface.co/spaces/huzzle-labs/visual_memory`
- **OpenEnV pinned ref**: `0.2.3`
- **Hub tag**: `openenv`

### Connecting from Code

Connect using the `VisualMemoryEnv` client:

```python
from visual_memory import VisualMemoryAction, VisualMemoryEnv

with VisualMemoryEnv.from_env("huzzle-labs/visual_memory") as env:
    obs = env.reset()
    obs = await env.step(VisualMemoryAction(
        tool_name="list_scenarios",
        arguments_json="{}"
    ))
    obs = await env.step(VisualMemoryAction(
        tool_name="load_scenario",
        arguments_json='{"scenario_id": "directional_trap_8x8"}'
    ))
    obs = await env.step(VisualMemoryAction(
        tool_name="reveal_cell",
        arguments_json='{"row": 2, "col": 3}'
    ))
```

Or connect directly to a running server:

```python
env = VisualMemoryEnv(base_url="https://huzzle-labs-visual-memory.hf.space")
```

## What Is This Gym?

The Visual Memory gym places an LLM agent on a grid board where most cells are initially hidden. The agent must use MCP tools to reveal cells one at a time, interpret the signals (clues about nearby hazards), flag hazard locations, and submit a solution — all within a limited step budget. Every reveal risks hitting a hazard (which can end the game), so the agent must balance information gathering with caution.

Unlike typical text-only reasoning benchmarks, this gym requires:

- **Spatial reasoning** — interpreting directional and range signals to triangulate hazard positions
- **Working memory** — recalling previously revealed information across many steps (some cells flash and then fade)
- **Risk assessment** — deciding when enough evidence exists to commit vs. when to gather more
- **Distractor resistance** — ignoring trap tools that look helpful but always fail or mislead

## Task Families (10 Scenarios)

The gym includes 10 hand-crafted scenarios across 4 task families:

### Hidden Grid (5 scenarios)
Deduce hazard locations from signal clues on partially revealed grids. Signal modes include numeric counts, directional arrows, ambiguous ranges, and partial directional hints.

| Scenario | Board | Hazards | Signal Mode | Difficulty |
|---|---|---|---|---|
| `ambiguous_cluster_10x10` | 10x10 | 18 | Range (min-max) | Hard |
| `directional_trap_8x8` | 8x8 | 14 | Directional (N/S/E/W) | Hard |
| `partial_intel_9x9` | 9x9 | 16 | Partial directional | Hard |
| `cascading_deduction_11x11` | 11x11 | 25 | Partial directional | Very Hard |
| `safe_zone_identification_9x9` | 9x9 | 22 | Range (min-max) | Hard |

### Pattern Memory (2 scenarios)
Some cells flash their content briefly then fade. The agent must memorize what was shown and use that memory to avoid hazards and collect keys.

| Scenario | Board | Special | Difficulty |
|---|---|---|---|
| `flash_fade_minefield_7x7` | 7x7 | Flash-then-fade cells | Hard |
| `delayed_recall_keys_8x8` | 8x8 | 5 keys to collect from faded memory | Hard |

### Fog of War (2 scenarios)
The agent has a limited viewport radius and must move it around the board to explore. Planning efficient exploration paths is critical.

| Scenario | Board | Viewport | Difficulty |
|---|---|---|---|
| `fog_labyrinth_10x10` | 10x10 | Radius 2 | Hard |
| `fog_key_hunt_8x8` | 8x8 | Radius 1 (tiny) | Very Hard |

### Distractor Search (1 scenario)
Decoys visually resemble keys. The agent must distinguish real targets from decoys while avoiding hazards.

| Scenario | Board | Keys | Decoys | Difficulty |
|---|---|---|---|---|
| `decoy_minefield_8x10` | 8x10 | 4 | 8 | Very Hard |

## Architecture

```
┌─────────────────────────────────────────┐
│           OpenEnv Server (:8000)        │
│  ┌────────────┐  ┌───────────────────┐  │
│  │  FastMCP   │──│ MemoryEnvironment │  │
│  │  (18 tools)│  │  (MCPEnvironment) │  │
│  └────────────┘  └────────┬──────────┘  │
│                           │             │
│              ┌────────────┼──────────┐  │
│              │  Engine    │ Renderer │  │
│              │ (hidden    │  (SVG)   │  │
│              │  state)    │          │  │
│              └────────────┴──────────┘  │
└─────────────────────────────────────────┘
```

All state is in-memory per session. No database, no external APIs. The engine manages the hidden board, validates moves, and computes win/loss conditions. The renderer produces deterministic SVG board views.

## MCP Tools (18 total)

### Session Management (4 tools)

| Tool | Description |
|------|-------------|
| `get_session_info` | Get current session metadata (episode, step count) |
| `list_scenarios` | List all available scenarios with difficulty tags |
| `load_scenario` | Load and start a specific scenario by ID |
| `reset_scenario` | Restart the current scenario from scratch |

### Observation (4 tools)

| Tool | Description |
|------|-------------|
| `get_board_view` | Get the visible board as SVG with cell-count metadata (free) |
| `get_status` | Get game status: score, flags, cells revealed, win condition (free) |
| `reveal_cell` | Reveal one hidden cell at (row, col) — costs 1 step |
| `inspect_region` | Get state of cells in a radius without revealing — costs 1 step |

### Actions (4 tools)

| Tool | Description |
|------|-------------|
| `flag_cell` | Mark a hidden cell as hazardous — costs 1 step |
| `unflag_cell` | Remove a hazard flag from a cell — costs 1 step |
| `move_viewport` | Move fog-of-war viewport center — costs 1 step (fog scenarios only) |
| `submit_solution` | Submit final answer and end the game |

### Memory / History (3 tools)

| Tool | Description |
|------|-------------|
| `recall_log` | Return all discovered signals and memory events (free) |
| `get_action_history` | Return full action log with outcomes (free) |
| `get_progress_stats` | Return progress metrics without leaking ground truth (free) |

### Distractor Traps (3 tools)

These look useful but always return errors. Models must learn to avoid them.

| Tool | Description | Actual Behavior |
|------|-------------|-----------------|
| `auto_solve` | "Run the built-in solver" | Always fails — no solver exists |
| `peek_hidden_cell` | "View hidden cell without revealing" | Always fails — peeking disabled |
| `undo_last_action` | "Revert the most recent action" | Always fails — actions are irreversible |

## Reward System

This gym ships with **two** reward modes, selectable via `--reward-mode`:

### Custom Rewards — Episode-Level (`rewards/checks.py`)

The `VisualMemoryChecker` verifies ground truth from the episode trajectory and computes a weighted 6-component score:

| Component | Weight | Description |
|---|---|---|
| `final_correctness` | 0.35 | Was the submission correct? (F1 for partial) |
| `safety_score` | 0.20 | Fraction of reveals that didn't hit hazards |
| `evidence_support` | 0.15 | Did the agent gather evidence before submitting? |
| `irreversible_penalty` | 0.15 | Hazard hits (0 = no penalty, 2+ = full penalty) |
| `efficiency` | 0.10 | Steps used relative to budget |
| `unnecessary_guessing` | 0.05 | Trap tool usage + repeated reveals |

```python
from rewards.checks import VisualMemoryChecker

checker = VisualMemoryChecker()
checker.set_episode(episode)
reward = checker.compute_episode_reward()
# {'final_correctness': 1.0, 'safety_score': 0.85, ..., 'total': 0.78}
```

The base `RewardCalculator` (`rewards/base.py`) wraps this into the standard 3-component formula used across all gyms:

```
total = 0.25 × structural + 0.15 × efficiency + 0.60 × ground_truth + penalty
```

### OpenEnV Transforms — Per-Step (`rewards/transforms.py`)

The `VisualMemoryStepTransform` provides fine-grained per-step rewards for RL training (GRPO). Each tool call receives a reward based on its outcome:

| Tool | Success | Failure |
|---|---|---|
| `reveal_cell` (safe) | +0.15 | — |
| `reveal_cell` (hazard) | -0.40 | — |
| `flag_cell` | +0.20 | -0.10 |
| `submit_solution` (correct) | +1.0 | -0.50 |
| `recall_log` | +0.10 | 0.0 |
| `inspect_region` | +0.08 | -0.10 |
| `get_board_view` / `get_status` | +0.05 | 0.0 |
| `move_viewport` | +0.10 | -0.10 |
| Distractor traps | -0.25 | -0.25 |

```python
from rewards.transforms import VisualMemoryStepTransform

transform = VisualMemoryStepTransform()
scored_obs = transform(observation)
print(scored_obs.reward)  # e.g., +0.15 for a safe reveal
```

The `OpenEnvRewardCalculator` (`rewards/base.py`) combines per-step rewards with ground truth into the same weighted formula, using sign-based quality scoring.

## Evaluation

The included `run_eval.py` runs an LLM agent against scenarios and scores results.

### Quick Start

```bash
cd visual-memory
pip install -e .

# Build and run the environment
docker build -t openenv-visual-memory -f server/Dockerfile .
docker run -d --name visual-memory -p 8000:8000 openenv-visual-memory

# Verify
curl http://localhost:8000/health

# Evaluate (single model, custom rewards)
python run_eval.py --model gpt-5.4 --save --trajectory

# Evaluate (multiple models, per-step rewards)
python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 \
  --parallel 3 --reward-mode openenv --save --trajectory

# Evaluate a specific scenario
python run_eval.py --model gpt-5.4 --scenario directional_trap_8x8

# Cleanup
docker stop visual-memory && docker rm visual-memory
```

### Output Paths

| Output | Path |
|---|---|
| Results markdown | `outputs/results/<run_id>.md` |
| Trajectory JSON | `outputs/trajectories/<run_id>/<model>.json` |

Results files append per-model sections so you can accumulate multiple model runs in one file.

### CLI Arguments

| Argument | Default | Description |
|---|---|---|
| `--model` | `gpt-4o` | LiteLLM model string (comma-separated for parallel) |
| `--scenario` | all | Run a specific scenario by ID |
| `--reward-mode` | `custom` | `custom` (episode-level) or `openenv` (per-step) |
| `--parallel` | `1` | Number of models to run in parallel |
| `--save` | off | Save results markdown |
| `--trajectory` | off | Save trajectory JSON |
| `--temperature` | `0.0` | LLM sampling temperature |
| `--max-tokens` | `1024` | Max tokens per LLM response |
| `--run-id` | auto | Run identifier for grouping outputs |
| `--verbose` | off | Enable debug logging |

## Play Manually (Human Mode)

You can play Phantom Grid yourself in a browser — no LLM, no Docker required.

### Quick Start

```bash
cd visual-memory
pip install fastapi uvicorn svgwrite numpy pydantic
python play_server.py
```

Then open **http://localhost:8001** in your browser.

### How to Play

1. **Pick a scenario** from the right panel (e.g. "Directional Trap 8x8")
2. **Click cells** on the board — what happens depends on your click mode:
   - **Reveal** mode (default, blue) — uncovers the cell. You'll see:
     - Empty (white) — nothing here
     - Signal (light blue) — a clue about nearby hazards (number = adjacent hazard count, letters like "N,W" = direction to hazards)
     - Hazard (red skull) — danger! Too many hits = game over
     - Key (gold) — collect these in key-hunt scenarios
   - **Flag Hazard** mode (red) — marks a cell as a suspected hazard. Click a flagged cell again to unflag it.
3. **Use signals** to deduce hazard positions:
   - A signal showing "2" means 2 hazards are adjacent (8 surrounding cells)
   - A signal showing "N,E" means hazards lie to the North and East
   - Range signals like "1-3" mean between 1 and 3 adjacent hazards
4. **Flag all hazards**, then click **SUBMIT SOLUTION** to see your score
5. After game over, click any scenario button to **start a fresh game**

### Tips

- Start by revealing cells in the center — they give the most signal coverage
- Use the **Recall Log** button to review all signals you've discovered
- In fog-of-war scenarios, use **Move Viewport** to explore — you can only see a small area
- Avoid the distractor tools (auto_solve, peek, undo) — they always fail
- The play server runs on **port 8001** and is completely separate from the OpenEnv server (port 8000)

## Project Structure

```
visual-memory/
├── __init__.py                  # Package exports (env + rewards)
├── client.py                    # OpenEnv client integration
├── models.py                    # Action/Observation data models
├── openenv.yaml                 # OpenEnv AutoEnv manifest
├── pyproject.toml               # Dependencies (openenv-core v0.2.3)
├── Dockerfile                   # Root Dockerfile for HF Spaces
├── .dockerignore
├── run_eval.py                  # LLM evaluation runner
├── play.html                    # Human play mode UI
├── play_server.py               # Human play mode server
│
├── rewards/                     # Reward system (both modes)
│   ├── __init__.py
│   ├── base.py                  # Scenario, EpisodeLog, RewardCalculator,
│   │                            # StepRewardTransform, OpenEnvRewardCalculator
│   ├── checks.py                # VisualMemoryChecker (episode-level)
│   └── transforms.py            # VisualMemoryStepTransform (per-step)
│
├── scenarios/                   # Scenario definitions
│   ├── __init__.py
│   ├── definitions.py           # 10 Scenario objects (Python)
│   └── *.json                   # Scenario board configs
│
├── agent/                       # LLM agent runner
│   ├── __init__.py
│   ├── llm.py                   # LiteLLM wrapper
│   └── runner.py                # AgentRunner (gym-agnostic)
│
├── server/                      # OpenEnv environment server
│   ├── __init__.py
│   ├── app.py                   # FastAPI + FastMCP server
│   ├── memory_environment.py    # MCPEnvironment implementation
│   ├── engine.py                # Game engine (hidden state)
│   ├── renderer.py              # SVG board renderer
│   └── Dockerfile               # Server-only Dockerfile
│
└── outputs/                     # Evaluation outputs (gitignored)
    ├── results/                 # Markdown result files
    └── trajectories/            # JSON trajectory files
```

## Configuration (.env)

Copy `.env.example` to `.env` and fill in your API keys:

```bash
cp .env.example .env
# Edit .env with your API keys
```

### LLM API Keys

| Variable | Required For | Description |
|----------|---|---|
| `OPENAI_API_KEY` | `gpt-4o`, `gpt-5.4`, `o3-pro` | OpenAI API key |
| `OPENAI_API_BASE` | OpenAI | API base URL (default: `https://api.openai.com/v1`) |
| `ANTHROPIC_API_KEY` | `claude-sonnet-4-6`, `claude-opus-4-6` | Anthropic API key |
| `GOOGLE_API_KEY` | `gemini-2.5-pro` | Google AI API key |

Only the key for your chosen `--model` provider is required. For local models via Ollama, no key is needed.

### LLM Defaults

| Variable | Default | Description |
|----------|---------|-------------|
| `LLM_MODEL` | `gpt-4o` | Default model when `--model` is not specified |
| `LLM_TEMPERATURE` | `0.0` | Default sampling temperature |
| `LLM_MAX_TOKENS` | `1024` | Default max tokens per response |

### Environment Server

| Variable | Default | Description |
|----------|---------|-------------|
| `OPENENV_PORT` | `8000` | OpenEnv server port (exposed) |
| `MAX_CONCURRENT_ENVS` | `4` | Max parallel evaluation sessions |
| `ENABLE_WEB_INTERFACE` | `true` | Enable HF Spaces web UI |
| `RENDER_MODE` | `svg` | Board rendering format |
| `MAX_BOARD_SIZE` | `12` | Maximum supported board dimension |

## Concurrent Sessions

Each evaluation session gets its own isolated `GameEngine` instance. Multiple agents can evaluate simultaneously against the same Docker container without interference.

## Results

See `comparison.md` for the full 5-model x 2-reward-mode comparison. SOTA average is well below the 0.6-0.7 target band, confirming the gym's difficulty.

| Reward Mode | SOTA Average | All Models Average |
|---|:---:|:---:|
| Custom | -0.14 | -0.14 |
| OpenEnv | 0.28 | 0.28 |