visual_memory / comparison.md
kdemon1011's picture
Upload folder using huggingface_hub
816634a verified

Visual Memory Gym β€” Model Comparison

Date: 2026-03-18
Gym Version: 0.1.0
Scenarios: 10 (across 4 task families)
Models: 5 (3 Anthropic, 2 OpenAI)
Reward Modes: custom (episode-level) and openenv (per-step transforms)


Overall Results

Custom Rewards (episode-level from rewards/base.py)

Reward components: Structural (0.25) + Ground Truth (0.60) + Efficiency (0.15) - Hallucination Penalty (up to -1.0)

# Model Avg Reward Best Scenario Worst Scenario Total Time
1 claude-opus-4-6 0.08 flash_fade_minefield (0.63) cascading_deduction (-0.83) 1518.9s
2 gpt-5 -0.13 flash_fade_minefield (0.69) decoy_minefield (-0.78) 2967.2s
3 gpt-5.4 -0.16 partial_intel (0.63) directional_trap (-0.81) 225.1s
4 claude-opus-4-20250514 -0.17 ambiguous_cluster (0.60) cascading_deduction (-0.80) 1197.4s
5 claude-sonnet-4-6 -0.33 partial_intel (0.40) directional_trap (-0.83) 1105.2s

OpenEnv Transform Rewards (per-step from rewards/transforms/)

Reward components: Step Rewards (0.40) + Ground Truth (0.60) - Hallucination Penalty (up to -1.0)

# Model Avg Reward Best Scenario Worst Scenario Total Time
1 claude-opus-4-6 0.31 directional_trap (0.37) decoy_minefield (0.14) 1584.6s
2 claude-opus-4-20250514 0.31 ambiguous_cluster (0.55) delayed_recall_keys (0.13) 1185.2s
3 gpt-5 0.28 flash_fade_minefield (0.53) ambiguous_cluster (0.12) 3770.8s
4 gpt-5.4 0.27 fog_labyrinth (0.52) directional_trap (0.08) 287.4s
5 claude-sonnet-4-6 0.26 ambiguous_cluster (0.35) directional_trap (0.14) 1048.0s

Per-Scenario Breakdown β€” Custom Rewards

Scenario gpt-5.4 gpt-5 claude-sonnet-4-6 claude-opus-4-6 claude-opus-4-20250514 Avg
ambiguous_cluster_10x10 0.40 -0.75 0.38 0.36 0.60 0.20
directional_trap_8x8 -0.81 0.41 -0.83 0.61 0.42 -0.04
partial_intel_9x9 0.63 0.42 0.40 0.41 0.44 0.46
flash_fade_minefield_7x7 0.42 0.69 -0.78 0.63 -0.76 0.04
delayed_recall_keys_8x8 0.45 0.53 -0.80 0.43 -0.79 -0.04
decoy_minefield_8x10 -0.80 -0.78 -0.81 -0.82 -0.79 -0.80
fog_labyrinth_10x10 -0.74 0.44 0.39 0.40 0.40 0.18
fog_key_hunt_8x8 -0.73 -0.77 -0.79 -0.80 -0.79 -0.78
cascading_deduction_11x11 -0.78 -0.76 -0.79 -0.83 -0.80 -0.79
safe_zone_identification_9x9 0.40 -0.77 0.37 0.38 0.41 0.16

Hardest Scenarios (Custom)

  1. decoy_minefield_8x10 (avg -0.80): All 5 models fail β€” hallucination penalty triggered universally
  2. cascading_deduction_11x11 (avg -0.79): Large board with partial signals defeats all models
  3. fog_key_hunt_8x8 (avg -0.78): Tiny viewport + fatal hazards β€” no model survives

Easiest Scenario (Custom)

  1. partial_intel_9x9 (avg 0.46): Most models achieve positive rewards here

Per-Scenario Breakdown β€” OpenEnv Transform Rewards

Scenario gpt-5.4 gpt-5 claude-sonnet-4-6 claude-opus-4-6 claude-opus-4-20250514 Avg
ambiguous_cluster_10x10 0.33 0.12 0.35 0.35 0.55 0.34
directional_trap_8x8 0.08 0.36 0.14 0.37 0.36 0.26
partial_intel_9x9 0.32 0.34 0.34 0.36 0.36 0.34
flash_fade_minefield_7x7 0.35 0.53 0.34 0.34 0.35 0.38
delayed_recall_keys_8x8 0.34 0.34 0.34 0.36 0.13 0.30
decoy_minefield_8x10 0.14 0.14 0.14 0.14 0.34 0.18
fog_labyrinth_10x10 0.52 0.34 0.34 0.35 0.35 0.38
fog_key_hunt_8x8 0.13 0.13 0.15 0.15 0.15 0.14
cascading_deduction_11x11 0.13 0.14 0.15 0.35 0.14 0.18
safe_zone_identification_9x9 0.34 0.34 0.34 0.34 0.35 0.34

Hardest Scenarios (OpenEnv)

  1. fog_key_hunt_8x8 (avg 0.14): Tiny viewport + fatal hazards β€” universally low
  2. decoy_minefield_8x10 (avg 0.18): Decoy-key confusion trips all models
  3. cascading_deduction_11x11 (avg 0.18): Large partial-signal board overwhelms reasoning

Easiest Scenarios (OpenEnv)

  1. flash_fade_minefield_7x7 (avg 0.38): Pattern memory β€” some models excel here
  2. fog_labyrinth_10x10 (avg 0.38): Fog navigation with reasonable viewport

SOTA Average Assessment

SOTA models (gpt-5.4, claude-sonnet-4-6, claude-opus-4-6):

Reward Mode SOTA Average Target Band Status
Custom -0.14 0.60–0.70 Well below target β€” no hardening needed
OpenEnv 0.28 0.60–0.70 Below target β€” no hardening needed

All 5 models average:

Reward Mode Overall Average
Custom -0.14
OpenEnv 0.28

The gym is currently harder than target across both reward modes. No hardening adjustments are required.


Reward Mode Comparison

Metric Custom OpenEnv
Mean across all models -0.14 0.28
Std deviation (models) 0.15 0.02
Min model avg -0.33 0.26
Max model avg 0.08 0.31
Hallucination penalties hit Frequent (-1.0) None triggered
Reward spread Very high (variance from penalties) Compressed (narrow 0.12–0.55 range)

Key insight: Custom rewards produce highly volatile scores driven by the -1.0 hallucination penalty. When models make even one incorrect assertion (tools report success but ground truth disagrees), the entire scenario score collapses. OpenEnv transform rewards are more granular and forgiving, rewarding incremental progress per-step.


Model Speed Rankings

Model Custom Time OpenEnv Time Avg per Scenario
gpt-5.4 225.1s 287.4s ~26s
claude-sonnet-4-6 1105.2s 1048.0s ~108s
claude-opus-4-20250514 1197.4s 1185.2s ~119s
claude-opus-4-6 1518.9s 1584.6s ~155s
gpt-5 2967.2s 3770.8s ~337s

GPT-5.4 is 6x faster than the next model while achieving competitive results.


Distractor Tool Usage

Models occasionally used distractor/trap tools, which indicates susceptibility to misleading tool descriptions:

  • peek_hidden_cell: Used by claude-opus-4-6 and claude-sonnet-4-6 (cheating tool β€” gives hidden info but penalized)
  • undo_last_action: Used by claude-sonnet-4-6 (no-op trap)
  • reset_scenario: Used by multiple models (resets game state β€” wastes steps)
  • auto_solve: Not used by any model (good β€” most egregious trap avoided)

Task Family Analysis

Task Family Scenarios Custom Avg OpenEnv Avg Difficulty
Hidden Grid (5) ambiguous, directional, partial, cascading, safe_zone -0.06 0.29 Medium-Hard
Pattern Memory (2) flash_fade, delayed_recall 0.00 0.34 Medium
Fog of War (2) fog_labyrinth, fog_key_hunt -0.30 0.26 Hard
Distractor Search (1) decoy_minefield -0.80 0.18 Very Hard

Files

Type Path
Custom results (all 5 models) results/visual_memory/run_visual_memory_custom.md
OpenEnv results (all 5 models) results/visual_memory/run_visual_memory_openenv.md
Custom trajectories (all 5 models) trajectories/visual_memory/run_visual_memory_custom/
OpenEnv trajectories (all 5 models) trajectories/visual_memory/run_visual_memory_openenv/