Spaces:

huzzle-labs
/

visual_memory

Sleeping

File size: 7,713 Bytes

816634a

# Visual Memory Gym — Model Comparison

**Date**: 2026-03-18  
**Gym Version**: `0.1.0`  
**Scenarios**: 10 (across 4 task families)  
**Models**: 5 (3 Anthropic, 2 OpenAI)  
**Reward Modes**: custom (episode-level) and openenv (per-step transforms)

---

## Overall Results

### Custom Rewards (episode-level from `rewards/base.py`)

Reward components: Structural (0.25) + Ground Truth (0.60) + Efficiency (0.15) - Hallucination Penalty (up to -1.0)

| # | Model | Avg Reward | Best Scenario | Worst Scenario | Total Time |
|---|-------|:---:|---|---|---:|
| 1 | `claude-opus-4-6` | **0.08** | flash_fade_minefield (0.63) | cascading_deduction (-0.83) | 1518.9s |
| 2 | `gpt-5` | **-0.13** | flash_fade_minefield (0.69) | decoy_minefield (-0.78) | 2967.2s |
| 3 | `gpt-5.4` | **-0.16** | partial_intel (0.63) | directional_trap (-0.81) | 225.1s |
| 4 | `claude-opus-4-20250514` | **-0.17** | ambiguous_cluster (0.60) | cascading_deduction (-0.80) | 1197.4s |
| 5 | `claude-sonnet-4-6` | **-0.33** | partial_intel (0.40) | directional_trap (-0.83) | 1105.2s |

### OpenEnv Transform Rewards (per-step from `rewards/transforms/`)

Reward components: Step Rewards (0.40) + Ground Truth (0.60) - Hallucination Penalty (up to -1.0)

| # | Model | Avg Reward | Best Scenario | Worst Scenario | Total Time |
|---|-------|:---:|---|---|---:|
| 1 | `claude-opus-4-6` | **0.31** | directional_trap (0.37) | decoy_minefield (0.14) | 1584.6s |
| 2 | `claude-opus-4-20250514` | **0.31** | ambiguous_cluster (0.55) | delayed_recall_keys (0.13) | 1185.2s |
| 3 | `gpt-5` | **0.28** | flash_fade_minefield (0.53) | ambiguous_cluster (0.12) | 3770.8s |
| 4 | `gpt-5.4` | **0.27** | fog_labyrinth (0.52) | directional_trap (0.08) | 287.4s |
| 5 | `claude-sonnet-4-6` | **0.26** | ambiguous_cluster (0.35) | directional_trap (0.14) | 1048.0s |

---

## Per-Scenario Breakdown — Custom Rewards

| Scenario | gpt-5.4 | gpt-5 | claude-sonnet-4-6 | claude-opus-4-6 | claude-opus-4-20250514 | Avg |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
| ambiguous_cluster_10x10 | 0.40 | -0.75 | 0.38 | 0.36 | 0.60 | **0.20** |
| directional_trap_8x8 | -0.81 | 0.41 | -0.83 | 0.61 | 0.42 | **-0.04** |
| partial_intel_9x9 | 0.63 | 0.42 | 0.40 | 0.41 | 0.44 | **0.46** |
| flash_fade_minefield_7x7 | 0.42 | 0.69 | -0.78 | 0.63 | -0.76 | **0.04** |
| delayed_recall_keys_8x8 | 0.45 | 0.53 | -0.80 | 0.43 | -0.79 | **-0.04** |
| decoy_minefield_8x10 | -0.80 | -0.78 | -0.81 | -0.82 | -0.79 | **-0.80** |
| fog_labyrinth_10x10 | -0.74 | 0.44 | 0.39 | 0.40 | 0.40 | **0.18** |
| fog_key_hunt_8x8 | -0.73 | -0.77 | -0.79 | -0.80 | -0.79 | **-0.78** |
| cascading_deduction_11x11 | -0.78 | -0.76 | -0.79 | -0.83 | -0.80 | **-0.79** |
| safe_zone_identification_9x9 | 0.40 | -0.77 | 0.37 | 0.38 | 0.41 | **0.16** |

### Hardest Scenarios (Custom)
1. **decoy_minefield_8x10** (avg -0.80): All 5 models fail — hallucination penalty triggered universally
2. **cascading_deduction_11x11** (avg -0.79): Large board with partial signals defeats all models
3. **fog_key_hunt_8x8** (avg -0.78): Tiny viewport + fatal hazards — no model survives

### Easiest Scenario (Custom)
1. **partial_intel_9x9** (avg 0.46): Most models achieve positive rewards here

---

## Per-Scenario Breakdown — OpenEnv Transform Rewards

| Scenario | gpt-5.4 | gpt-5 | claude-sonnet-4-6 | claude-opus-4-6 | claude-opus-4-20250514 | Avg |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
| ambiguous_cluster_10x10 | 0.33 | 0.12 | 0.35 | 0.35 | 0.55 | **0.34** |
| directional_trap_8x8 | 0.08 | 0.36 | 0.14 | 0.37 | 0.36 | **0.26** |
| partial_intel_9x9 | 0.32 | 0.34 | 0.34 | 0.36 | 0.36 | **0.34** |
| flash_fade_minefield_7x7 | 0.35 | 0.53 | 0.34 | 0.34 | 0.35 | **0.38** |
| delayed_recall_keys_8x8 | 0.34 | 0.34 | 0.34 | 0.36 | 0.13 | **0.30** |
| decoy_minefield_8x10 | 0.14 | 0.14 | 0.14 | 0.14 | 0.34 | **0.18** |
| fog_labyrinth_10x10 | 0.52 | 0.34 | 0.34 | 0.35 | 0.35 | **0.38** |
| fog_key_hunt_8x8 | 0.13 | 0.13 | 0.15 | 0.15 | 0.15 | **0.14** |
| cascading_deduction_11x11 | 0.13 | 0.14 | 0.15 | 0.35 | 0.14 | **0.18** |
| safe_zone_identification_9x9 | 0.34 | 0.34 | 0.34 | 0.34 | 0.35 | **0.34** |

### Hardest Scenarios (OpenEnv)
1. **fog_key_hunt_8x8** (avg 0.14): Tiny viewport + fatal hazards — universally low
2. **decoy_minefield_8x10** (avg 0.18): Decoy-key confusion trips all models
3. **cascading_deduction_11x11** (avg 0.18): Large partial-signal board overwhelms reasoning

### Easiest Scenarios (OpenEnv)
1. **flash_fade_minefield_7x7** (avg 0.38): Pattern memory — some models excel here
2. **fog_labyrinth_10x10** (avg 0.38): Fog navigation with reasonable viewport

---

## SOTA Average Assessment

**SOTA models** (gpt-5.4, claude-sonnet-4-6, claude-opus-4-6):

| Reward Mode | SOTA Average | Target Band | Status |
|---|:---:|:---:|---|
| Custom | **-0.14** | 0.60–0.70 | Well below target — no hardening needed |
| OpenEnv | **0.28** | 0.60–0.70 | Below target — no hardening needed |

**All 5 models average:**

| Reward Mode | Overall Average |
|---|:---:|
| Custom | **-0.14** |
| OpenEnv | **0.28** |

The gym is currently **harder than target** across both reward modes. No hardening adjustments are required.

---

## Reward Mode Comparison

| Metric | Custom | OpenEnv |
|---|:---:|:---:|
| Mean across all models | -0.14 | 0.28 |
| Std deviation (models) | 0.15 | 0.02 |
| Min model avg | -0.33 | 0.26 |
| Max model avg | 0.08 | 0.31 |
| Hallucination penalties hit | Frequent (-1.0) | None triggered |
| Reward spread | Very high (variance from penalties) | Compressed (narrow 0.12–0.55 range) |

**Key insight**: Custom rewards produce highly volatile scores driven by the -1.0 hallucination penalty. When models make even one incorrect assertion (tools report success but ground truth disagrees), the entire scenario score collapses. OpenEnv transform rewards are more granular and forgiving, rewarding incremental progress per-step.

---

## Model Speed Rankings

| Model | Custom Time | OpenEnv Time | Avg per Scenario |
|---|:---:|:---:|:---:|
| `gpt-5.4` | 225.1s | 287.4s | ~26s |
| `claude-sonnet-4-6` | 1105.2s | 1048.0s | ~108s |
| `claude-opus-4-20250514` | 1197.4s | 1185.2s | ~119s |
| `claude-opus-4-6` | 1518.9s | 1584.6s | ~155s |
| `gpt-5` | 2967.2s | 3770.8s | ~337s |

GPT-5.4 is 6x faster than the next model while achieving competitive results.

---

## Distractor Tool Usage

Models occasionally used distractor/trap tools, which indicates susceptibility to misleading tool descriptions:

- **`peek_hidden_cell`**: Used by claude-opus-4-6 and claude-sonnet-4-6 (cheating tool — gives hidden info but penalized)
- **`undo_last_action`**: Used by claude-sonnet-4-6 (no-op trap)
- **`reset_scenario`**: Used by multiple models (resets game state — wastes steps)
- **`auto_solve`**: Not used by any model (good — most egregious trap avoided)

---

## Task Family Analysis

| Task Family | Scenarios | Custom Avg | OpenEnv Avg | Difficulty |
|---|---|:---:|:---:|---|
| Hidden Grid (5) | ambiguous, directional, partial, cascading, safe_zone | -0.06 | 0.29 | Medium-Hard |
| Pattern Memory (2) | flash_fade, delayed_recall | 0.00 | 0.34 | Medium |
| Fog of War (2) | fog_labyrinth, fog_key_hunt | -0.30 | 0.26 | Hard |
| Distractor Search (1) | decoy_minefield | -0.80 | 0.18 | Very Hard |

---

## Files

| Type | Path |
|---|---|
| Custom results (all 5 models) | `results/visual_memory/run_visual_memory_custom.md` |
| OpenEnv results (all 5 models) | `results/visual_memory/run_visual_memory_openenv.md` |
| Custom trajectories (all 5 models) | `trajectories/visual_memory/run_visual_memory_custom/` |
| OpenEnv trajectories (all 5 models) | `trajectories/visual_memory/run_visual_memory_openenv/` |