File size: 21,589 Bytes
5009279
 
816634a
5009279
 
 
 
816634a
81f5b19
15503f9
816634a
 
15503f9
816634a
5009279
 
816634a
 
 
 
 
 
cf97313
 
84ba78e
cf97313
 
 
 
84ba78e
 
 
 
 
 
 
 
cf97313
 
 
84ba78e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf97313
 
84ba78e
 
 
cf97313
 
 
 
 
 
 
 
84ba78e
cf97313
 
 
 
 
 
 
 
 
15503f9
 
 
 
 
 
 
 
 
 
6b3de18
 
15503f9
6b3de18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15503f9
6b3de18
 
 
 
15503f9
 
816634a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15503f9
 
 
816634a
15503f9
816634a
15503f9
816634a
 
 
15503f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
816634a
15503f9
816634a
15503f9
 
 
816634a
15503f9
816634a
15503f9
816634a
15503f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
816634a
 
 
 
 
15503f9
 
816634a
15503f9
 
816634a
 
15503f9
 
 
 
816634a
 
 
15503f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
816634a
 
15503f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
816634a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
---
title: Visual Memory
emoji: 🧠
colorFrom: purple
colorTo: indigo
sdk: docker
pinned: false
license: mit
app_port: 8000
base_path: /web
tags:
  - openenv
  - openenv-0.2.3
  - rl-environment
---

# Visual Memory Gym β€” *Phantom Grid*

**Hidden-state visual reasoning and planning under partial observability.**

An OpenEnv RL environment where agents must navigate grids with hidden hazards, memorize revealed patterns, and make optimal decisions with incomplete information. The name *Phantom Grid* reflects the core challenge: invisible dangers lurk beneath every cell, and the agent must deduce their locations from indirect signals β€” like hunting phantoms by their shadows. Designed to stress spatial reasoning, working memory, uncertainty handling, and risk-averse planning β€” areas where frontier LLMs consistently underperform.

## Playground Quick Start

Use the **Playground** panel (right side) to interact with the environment. Type a **Tool Name** and **Arguments Json**, then click **Step**.

### Typical workflow

1. Click **Reset** to start a fresh session
2. Enter `list_tools` (args: `{}`) β†’ discover all available tools and their parameters
3. Enter `list_scenarios` (args: `{}`) β†’ see all 10 scenarios
4. Enter `load_scenario` (args: `{"scenario_id": "directional_trap_8x8"}`) β†’ start a game
5. Enter `get_board_view` (args: `{}`) β†’ see the board as SVG
6. Enter `reveal_cell` (args: `{"row": 0, "col": 0}`) β†’ uncover a cell and read its signal
7. Enter `inspect_region` (args: `{"center_row": 3, "center_col": 3, "radius": 1}`) β†’ peek at nearby cells without revealing
8. Enter `flag_cell` (args: `{"row": 3, "col": 5}`) β†’ mark a suspected hazard
9. Enter `submit_solution` (args: `{"flagged_positions": "[[3,5]]"}`) β†’ submit your answer (ends the game)

### All tool commands (copy-paste ready)

#### Discovery & session tools

| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `list_tools` | `{}` | List every available tool with its parameters and types |
| `get_session_info` | `{}` | Current session/episode ID, step count, whether a scenario is loaded |
| `list_scenarios` | `{}` | List all 10 scenarios with difficulty, board size, and how-to-play hints |
| `load_scenario` | `{"scenario_id": "directional_trap_8x8"}` | Load and start a scenario (resets any in-progress game) |
| `reset_scenario` | `{}` | Restart the current scenario from scratch |

#### Observation tools

| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `get_board_view` | `{}` | Render the board as SVG with cell-count metadata (free β€” no step cost) |
| `get_status` | `{}` | Game status: step count, max steps, flags remaining, game over state (free) |
| `reveal_cell` | `{"row": 0, "col": 0}` | Reveal a hidden cell β€” returns its content (costs 1 step) |
| `inspect_region` | `{"center_row": 3, "center_col": 3, "radius": 1}` | Peek at cells in a radius without revealing them (costs 1 step) |
| `move_viewport` | `{"row": 5, "col": 5}` | Move the fog-of-war camera center (fog scenarios only, costs 1 step) |

> **Note:** `inspect_region` uses `center_row` / `center_col` (not `row` / `col`). `radius` is optional and defaults to `1`.

#### Action tools

| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `flag_cell` | `{"row": 1, "col": 1}` | Mark a cell as hazardous (costs 1 step) |
| `unflag_cell` | `{"row": 1, "col": 1}` | Remove a hazard flag (costs 1 step) |
| `submit_solution` | `{"flagged_positions": "[[0,1],[2,3]]"}` | Submit your final answer β€” ends the game |

> **Note:** `submit_solution` also accepts an optional `safe_positions` argument (JSON string of `[[row,col],...]`).

#### Memory & history tools

| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `recall_log` | `{}` | Review all signals and memory events discovered so far (free) |
| `get_action_history` | `{}` | Full log of every action taken and its outcome (free) |
| `get_progress_stats` | `{}` | Progress metrics: % cells revealed, flags placed, steps remaining (free) |

#### Trap tools (avoid these!)

These exist to test whether an agent takes shortcuts. They always fail and give a **-0.1 reward penalty**.

| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `auto_solve` | `{}` | Attempts to auto-solve β€” always rejected |
| `peek_hidden_cell` | `{"row": 2, "col": 2}` | Attempts to cheat-peek a cell β€” always rejected |
| `undo_last_action` | `{}` | Attempts to undo β€” always rejected |

### Run locally

```bash
cd visual-memory
pip install -e .

# Start the environment server
docker build -t openenv-visual-memory -f Dockerfile .
docker run -d --name visual-memory -p 8000:8000 openenv-visual-memory

# Verify it's running
curl http://localhost:8000/health

# Open the playground in your browser
open http://localhost:8000/web/
```

## Hugging Face Space Deployment

This Space is built from OpenEnV environment `visual_memory`.

- **Space URL**: `https://huggingface.co/spaces/huzzle-labs/visual_memory`
- **OpenEnV pinned ref**: `0.2.3`
- **Hub tag**: `openenv`

### Connecting from Code

Connect using the `VisualMemoryEnv` client:

```python
from visual_memory import VisualMemoryAction, VisualMemoryEnv

with VisualMemoryEnv.from_env("huzzle-labs/visual_memory") as env:
    obs = env.reset()
    obs = await env.step(VisualMemoryAction(
        tool_name="list_scenarios",
        arguments_json="{}"
    ))
    obs = await env.step(VisualMemoryAction(
        tool_name="load_scenario",
        arguments_json='{"scenario_id": "directional_trap_8x8"}'
    ))
    obs = await env.step(VisualMemoryAction(
        tool_name="reveal_cell",
        arguments_json='{"row": 2, "col": 3}'
    ))
```

Or connect directly to a running server:

```python
env = VisualMemoryEnv(base_url="https://huzzle-labs-visual-memory.hf.space")
```

## What Is This Gym?

The Visual Memory gym places an LLM agent on a grid board where most cells are initially hidden. The agent must use MCP tools to reveal cells one at a time, interpret the signals (clues about nearby hazards), flag hazard locations, and submit a solution β€” all within a limited step budget. Every reveal risks hitting a hazard (which can end the game), so the agent must balance information gathering with caution.

Unlike typical text-only reasoning benchmarks, this gym requires:

- **Spatial reasoning** β€” interpreting directional and range signals to triangulate hazard positions
- **Working memory** β€” recalling previously revealed information across many steps (some cells flash and then fade)
- **Risk assessment** β€” deciding when enough evidence exists to commit vs. when to gather more
- **Distractor resistance** β€” ignoring trap tools that look helpful but always fail or mislead

## Task Families (10 Scenarios)

The gym includes 10 hand-crafted scenarios across 4 task families:

### Hidden Grid (5 scenarios)
Deduce hazard locations from signal clues on partially revealed grids. Signal modes include numeric counts, directional arrows, ambiguous ranges, and partial directional hints.

| Scenario | Board | Hazards | Signal Mode | Difficulty |
|---|---|---|---|---|
| `ambiguous_cluster_10x10` | 10x10 | 18 | Range (min-max) | Hard |
| `directional_trap_8x8` | 8x8 | 14 | Directional (N/S/E/W) | Hard |
| `partial_intel_9x9` | 9x9 | 16 | Partial directional | Hard |
| `cascading_deduction_11x11` | 11x11 | 25 | Partial directional | Very Hard |
| `safe_zone_identification_9x9` | 9x9 | 22 | Range (min-max) | Hard |

### Pattern Memory (2 scenarios)
Some cells flash their content briefly then fade. The agent must memorize what was shown and use that memory to avoid hazards and collect keys.

| Scenario | Board | Special | Difficulty |
|---|---|---|---|
| `flash_fade_minefield_7x7` | 7x7 | Flash-then-fade cells | Hard |
| `delayed_recall_keys_8x8` | 8x8 | 5 keys to collect from faded memory | Hard |

### Fog of War (2 scenarios)
The agent has a limited viewport radius and must move it around the board to explore. Planning efficient exploration paths is critical.

| Scenario | Board | Viewport | Difficulty |
|---|---|---|---|
| `fog_labyrinth_10x10` | 10x10 | Radius 2 | Hard |
| `fog_key_hunt_8x8` | 8x8 | Radius 1 (tiny) | Very Hard |

### Distractor Search (1 scenario)
Decoys visually resemble keys. The agent must distinguish real targets from decoys while avoiding hazards.

| Scenario | Board | Keys | Decoys | Difficulty |
|---|---|---|---|---|
| `decoy_minefield_8x10` | 8x10 | 4 | 8 | Very Hard |

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           OpenEnv Server (:8000)        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  FastMCP   │──│ MemoryEnvironment β”‚  β”‚
β”‚  β”‚  (18 tools)β”‚  β”‚  (MCPEnvironment) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                           β”‚             β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚              β”‚  Engine    β”‚ Renderer β”‚  β”‚
β”‚              β”‚ (hidden    β”‚  (SVG)   β”‚  β”‚
β”‚              β”‚  state)    β”‚          β”‚  β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

All state is in-memory per session. No database, no external APIs. The engine manages the hidden board, validates moves, and computes win/loss conditions. The renderer produces deterministic SVG board views.

## MCP Tools (18 total)

### Session Management (4 tools)

| Tool | Description |
|------|-------------|
| `get_session_info` | Get current session metadata (episode, step count) |
| `list_scenarios` | List all available scenarios with difficulty tags |
| `load_scenario` | Load and start a specific scenario by ID |
| `reset_scenario` | Restart the current scenario from scratch |

### Observation (4 tools)

| Tool | Description |
|------|-------------|
| `get_board_view` | Get the visible board as SVG with cell-count metadata (free) |
| `get_status` | Get game status: score, flags, cells revealed, win condition (free) |
| `reveal_cell` | Reveal one hidden cell at (row, col) β€” costs 1 step |
| `inspect_region` | Get state of cells in a radius without revealing β€” costs 1 step |

### Actions (4 tools)

| Tool | Description |
|------|-------------|
| `flag_cell` | Mark a hidden cell as hazardous β€” costs 1 step |
| `unflag_cell` | Remove a hazard flag from a cell β€” costs 1 step |
| `move_viewport` | Move fog-of-war viewport center β€” costs 1 step (fog scenarios only) |
| `submit_solution` | Submit final answer and end the game |

### Memory / History (3 tools)

| Tool | Description |
|------|-------------|
| `recall_log` | Return all discovered signals and memory events (free) |
| `get_action_history` | Return full action log with outcomes (free) |
| `get_progress_stats` | Return progress metrics without leaking ground truth (free) |

### Distractor Traps (3 tools)

These look useful but always return errors. Models must learn to avoid them.

| Tool | Description | Actual Behavior |
|------|-------------|-----------------|
| `auto_solve` | "Run the built-in solver" | Always fails β€” no solver exists |
| `peek_hidden_cell` | "View hidden cell without revealing" | Always fails β€” peeking disabled |
| `undo_last_action` | "Revert the most recent action" | Always fails β€” actions are irreversible |

## Reward System

This gym ships with **two** reward modes, selectable via `--reward-mode`:

### Custom Rewards β€” Episode-Level (`rewards/checks.py`)

The `VisualMemoryChecker` verifies ground truth from the episode trajectory and computes a weighted 6-component score:

| Component | Weight | Description |
|---|---|---|
| `final_correctness` | 0.35 | Was the submission correct? (F1 for partial) |
| `safety_score` | 0.20 | Fraction of reveals that didn't hit hazards |
| `evidence_support` | 0.15 | Did the agent gather evidence before submitting? |
| `irreversible_penalty` | 0.15 | Hazard hits (0 = no penalty, 2+ = full penalty) |
| `efficiency` | 0.10 | Steps used relative to budget |
| `unnecessary_guessing` | 0.05 | Trap tool usage + repeated reveals |

```python
from rewards.checks import VisualMemoryChecker

checker = VisualMemoryChecker()
checker.set_episode(episode)
reward = checker.compute_episode_reward()
# {'final_correctness': 1.0, 'safety_score': 0.85, ..., 'total': 0.78}
```

The base `RewardCalculator` (`rewards/base.py`) wraps this into the standard 3-component formula used across all gyms:

```
total = 0.25 Γ— structural + 0.15 Γ— efficiency + 0.60 Γ— ground_truth + penalty
```

### OpenEnV Transforms β€” Per-Step (`rewards/transforms.py`)

The `VisualMemoryStepTransform` provides fine-grained per-step rewards for RL training (GRPO). Each tool call receives a reward based on its outcome:

| Tool | Success | Failure |
|---|---|---|
| `reveal_cell` (safe) | +0.15 | β€” |
| `reveal_cell` (hazard) | -0.40 | β€” |
| `flag_cell` | +0.20 | -0.10 |
| `submit_solution` (correct) | +1.0 | -0.50 |
| `recall_log` | +0.10 | 0.0 |
| `inspect_region` | +0.08 | -0.10 |
| `get_board_view` / `get_status` | +0.05 | 0.0 |
| `move_viewport` | +0.10 | -0.10 |
| Distractor traps | -0.25 | -0.25 |

```python
from rewards.transforms import VisualMemoryStepTransform

transform = VisualMemoryStepTransform()
scored_obs = transform(observation)
print(scored_obs.reward)  # e.g., +0.15 for a safe reveal
```

The `OpenEnvRewardCalculator` (`rewards/base.py`) combines per-step rewards with ground truth into the same weighted formula, using sign-based quality scoring.

## Evaluation

The included `run_eval.py` runs an LLM agent against scenarios and scores results.

### Quick Start

```bash
cd visual-memory
pip install -e .

# Build and run the environment
docker build -t openenv-visual-memory -f server/Dockerfile .
docker run -d --name visual-memory -p 8000:8000 openenv-visual-memory

# Verify
curl http://localhost:8000/health

# Evaluate (single model, custom rewards)
python run_eval.py --model gpt-5.4 --save --trajectory

# Evaluate (multiple models, per-step rewards)
python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 \
  --parallel 3 --reward-mode openenv --save --trajectory

# Evaluate a specific scenario
python run_eval.py --model gpt-5.4 --scenario directional_trap_8x8

# Cleanup
docker stop visual-memory && docker rm visual-memory
```

### Output Paths

| Output | Path |
|---|---|
| Results markdown | `outputs/results/<run_id>.md` |
| Trajectory JSON | `outputs/trajectories/<run_id>/<model>.json` |

Results files append per-model sections so you can accumulate multiple model runs in one file.

### CLI Arguments

| Argument | Default | Description |
|---|---|---|
| `--model` | `gpt-4o` | LiteLLM model string (comma-separated for parallel) |
| `--scenario` | all | Run a specific scenario by ID |
| `--reward-mode` | `custom` | `custom` (episode-level) or `openenv` (per-step) |
| `--parallel` | `1` | Number of models to run in parallel |
| `--save` | off | Save results markdown |
| `--trajectory` | off | Save trajectory JSON |
| `--temperature` | `0.0` | LLM sampling temperature |
| `--max-tokens` | `1024` | Max tokens per LLM response |
| `--run-id` | auto | Run identifier for grouping outputs |
| `--verbose` | off | Enable debug logging |

## Play Manually (Human Mode)

You can play Phantom Grid yourself in a browser β€” no LLM, no Docker required.

### Quick Start

```bash
cd visual-memory
pip install fastapi uvicorn svgwrite numpy pydantic
python play_server.py
```

Then open **http://localhost:8001** in your browser.

### How to Play

1. **Pick a scenario** from the right panel (e.g. "Directional Trap 8x8")
2. **Click cells** on the board β€” what happens depends on your click mode:
   - **Reveal** mode (default, blue) β€” uncovers the cell. You'll see:
     - Empty (white) β€” nothing here
     - Signal (light blue) β€” a clue about nearby hazards (number = adjacent hazard count, letters like "N,W" = direction to hazards)
     - Hazard (red skull) β€” danger! Too many hits = game over
     - Key (gold) β€” collect these in key-hunt scenarios
   - **Flag Hazard** mode (red) β€” marks a cell as a suspected hazard. Click a flagged cell again to unflag it.
3. **Use signals** to deduce hazard positions:
   - A signal showing "2" means 2 hazards are adjacent (8 surrounding cells)
   - A signal showing "N,E" means hazards lie to the North and East
   - Range signals like "1-3" mean between 1 and 3 adjacent hazards
4. **Flag all hazards**, then click **SUBMIT SOLUTION** to see your score
5. After game over, click any scenario button to **start a fresh game**

### Tips

- Start by revealing cells in the center β€” they give the most signal coverage
- Use the **Recall Log** button to review all signals you've discovered
- In fog-of-war scenarios, use **Move Viewport** to explore β€” you can only see a small area
- Avoid the distractor tools (auto_solve, peek, undo) β€” they always fail
- The play server runs on **port 8001** and is completely separate from the OpenEnv server (port 8000)

## Project Structure

```
visual-memory/
β”œβ”€β”€ __init__.py                  # Package exports (env + rewards)
β”œβ”€β”€ client.py                    # OpenEnv client integration
β”œβ”€β”€ models.py                    # Action/Observation data models
β”œβ”€β”€ openenv.yaml                 # OpenEnv AutoEnv manifest
β”œβ”€β”€ pyproject.toml               # Dependencies (openenv-core v0.2.3)
β”œβ”€β”€ Dockerfile                   # Root Dockerfile for HF Spaces
β”œβ”€β”€ .dockerignore
β”œβ”€β”€ run_eval.py                  # LLM evaluation runner
β”œβ”€β”€ play.html                    # Human play mode UI
β”œβ”€β”€ play_server.py               # Human play mode server
β”‚
β”œβ”€β”€ rewards/                     # Reward system (both modes)
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base.py                  # Scenario, EpisodeLog, RewardCalculator,
β”‚   β”‚                            # StepRewardTransform, OpenEnvRewardCalculator
β”‚   β”œβ”€β”€ checks.py                # VisualMemoryChecker (episode-level)
β”‚   └── transforms.py            # VisualMemoryStepTransform (per-step)
β”‚
β”œβ”€β”€ scenarios/                   # Scenario definitions
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ definitions.py           # 10 Scenario objects (Python)
β”‚   └── *.json                   # Scenario board configs
β”‚
β”œβ”€β”€ agent/                       # LLM agent runner
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ llm.py                   # LiteLLM wrapper
β”‚   └── runner.py                # AgentRunner (gym-agnostic)
β”‚
β”œβ”€β”€ server/                      # OpenEnv environment server
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py                   # FastAPI + FastMCP server
β”‚   β”œβ”€β”€ memory_environment.py    # MCPEnvironment implementation
β”‚   β”œβ”€β”€ engine.py                # Game engine (hidden state)
β”‚   β”œβ”€β”€ renderer.py              # SVG board renderer
β”‚   └── Dockerfile               # Server-only Dockerfile
β”‚
└── outputs/                     # Evaluation outputs (gitignored)
    β”œβ”€β”€ results/                 # Markdown result files
    └── trajectories/            # JSON trajectory files
```

## Configuration (.env)

Copy `.env.example` to `.env` and fill in your API keys:

```bash
cp .env.example .env
# Edit .env with your API keys
```

### LLM API Keys

| Variable | Required For | Description |
|----------|---|---|
| `OPENAI_API_KEY` | `gpt-4o`, `gpt-5.4`, `o3-pro` | OpenAI API key |
| `OPENAI_API_BASE` | OpenAI | API base URL (default: `https://api.openai.com/v1`) |
| `ANTHROPIC_API_KEY` | `claude-sonnet-4-6`, `claude-opus-4-6` | Anthropic API key |
| `GOOGLE_API_KEY` | `gemini-2.5-pro` | Google AI API key |

Only the key for your chosen `--model` provider is required. For local models via Ollama, no key is needed.

### LLM Defaults

| Variable | Default | Description |
|----------|---------|-------------|
| `LLM_MODEL` | `gpt-4o` | Default model when `--model` is not specified |
| `LLM_TEMPERATURE` | `0.0` | Default sampling temperature |
| `LLM_MAX_TOKENS` | `1024` | Default max tokens per response |

### Environment Server

| Variable | Default | Description |
|----------|---------|-------------|
| `OPENENV_PORT` | `8000` | OpenEnv server port (exposed) |
| `MAX_CONCURRENT_ENVS` | `4` | Max parallel evaluation sessions |
| `ENABLE_WEB_INTERFACE` | `true` | Enable HF Spaces web UI |
| `RENDER_MODE` | `svg` | Board rendering format |
| `MAX_BOARD_SIZE` | `12` | Maximum supported board dimension |

## Concurrent Sessions

Each evaluation session gets its own isolated `GameEngine` instance. Multiple agents can evaluate simultaneously against the same Docker container without interference.

## Results

See `comparison.md` for the full 5-model x 2-reward-mode comparison. SOTA average is well below the 0.6-0.7 target band, confirming the gym's difficulty.

| Reward Mode | SOTA Average | All Models Average |
|---|:---:|:---:|
| Custom | -0.14 | -0.14 |
| OpenEnv | 0.28 | 0.28 |