File size: 7,713 Bytes
816634a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# Visual Memory Gym β€” Model Comparison

**Date**: 2026-03-18  
**Gym Version**: `0.1.0`  
**Scenarios**: 10 (across 4 task families)  
**Models**: 5 (3 Anthropic, 2 OpenAI)  
**Reward Modes**: custom (episode-level) and openenv (per-step transforms)

---

## Overall Results

### Custom Rewards (episode-level from `rewards/base.py`)

Reward components: Structural (0.25) + Ground Truth (0.60) + Efficiency (0.15) - Hallucination Penalty (up to -1.0)

| # | Model | Avg Reward | Best Scenario | Worst Scenario | Total Time |
|---|-------|:---:|---|---|---:|
| 1 | `claude-opus-4-6` | **0.08** | flash_fade_minefield (0.63) | cascading_deduction (-0.83) | 1518.9s |
| 2 | `gpt-5` | **-0.13** | flash_fade_minefield (0.69) | decoy_minefield (-0.78) | 2967.2s |
| 3 | `gpt-5.4` | **-0.16** | partial_intel (0.63) | directional_trap (-0.81) | 225.1s |
| 4 | `claude-opus-4-20250514` | **-0.17** | ambiguous_cluster (0.60) | cascading_deduction (-0.80) | 1197.4s |
| 5 | `claude-sonnet-4-6` | **-0.33** | partial_intel (0.40) | directional_trap (-0.83) | 1105.2s |

### OpenEnv Transform Rewards (per-step from `rewards/transforms/`)

Reward components: Step Rewards (0.40) + Ground Truth (0.60) - Hallucination Penalty (up to -1.0)

| # | Model | Avg Reward | Best Scenario | Worst Scenario | Total Time |
|---|-------|:---:|---|---|---:|
| 1 | `claude-opus-4-6` | **0.31** | directional_trap (0.37) | decoy_minefield (0.14) | 1584.6s |
| 2 | `claude-opus-4-20250514` | **0.31** | ambiguous_cluster (0.55) | delayed_recall_keys (0.13) | 1185.2s |
| 3 | `gpt-5` | **0.28** | flash_fade_minefield (0.53) | ambiguous_cluster (0.12) | 3770.8s |
| 4 | `gpt-5.4` | **0.27** | fog_labyrinth (0.52) | directional_trap (0.08) | 287.4s |
| 5 | `claude-sonnet-4-6` | **0.26** | ambiguous_cluster (0.35) | directional_trap (0.14) | 1048.0s |

---

## Per-Scenario Breakdown β€” Custom Rewards

| Scenario | gpt-5.4 | gpt-5 | claude-sonnet-4-6 | claude-opus-4-6 | claude-opus-4-20250514 | Avg |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
| ambiguous_cluster_10x10 | 0.40 | -0.75 | 0.38 | 0.36 | 0.60 | **0.20** |
| directional_trap_8x8 | -0.81 | 0.41 | -0.83 | 0.61 | 0.42 | **-0.04** |
| partial_intel_9x9 | 0.63 | 0.42 | 0.40 | 0.41 | 0.44 | **0.46** |
| flash_fade_minefield_7x7 | 0.42 | 0.69 | -0.78 | 0.63 | -0.76 | **0.04** |
| delayed_recall_keys_8x8 | 0.45 | 0.53 | -0.80 | 0.43 | -0.79 | **-0.04** |
| decoy_minefield_8x10 | -0.80 | -0.78 | -0.81 | -0.82 | -0.79 | **-0.80** |
| fog_labyrinth_10x10 | -0.74 | 0.44 | 0.39 | 0.40 | 0.40 | **0.18** |
| fog_key_hunt_8x8 | -0.73 | -0.77 | -0.79 | -0.80 | -0.79 | **-0.78** |
| cascading_deduction_11x11 | -0.78 | -0.76 | -0.79 | -0.83 | -0.80 | **-0.79** |
| safe_zone_identification_9x9 | 0.40 | -0.77 | 0.37 | 0.38 | 0.41 | **0.16** |

### Hardest Scenarios (Custom)
1. **decoy_minefield_8x10** (avg -0.80): All 5 models fail β€” hallucination penalty triggered universally
2. **cascading_deduction_11x11** (avg -0.79): Large board with partial signals defeats all models
3. **fog_key_hunt_8x8** (avg -0.78): Tiny viewport + fatal hazards β€” no model survives

### Easiest Scenario (Custom)
1. **partial_intel_9x9** (avg 0.46): Most models achieve positive rewards here

---

## Per-Scenario Breakdown β€” OpenEnv Transform Rewards

| Scenario | gpt-5.4 | gpt-5 | claude-sonnet-4-6 | claude-opus-4-6 | claude-opus-4-20250514 | Avg |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
| ambiguous_cluster_10x10 | 0.33 | 0.12 | 0.35 | 0.35 | 0.55 | **0.34** |
| directional_trap_8x8 | 0.08 | 0.36 | 0.14 | 0.37 | 0.36 | **0.26** |
| partial_intel_9x9 | 0.32 | 0.34 | 0.34 | 0.36 | 0.36 | **0.34** |
| flash_fade_minefield_7x7 | 0.35 | 0.53 | 0.34 | 0.34 | 0.35 | **0.38** |
| delayed_recall_keys_8x8 | 0.34 | 0.34 | 0.34 | 0.36 | 0.13 | **0.30** |
| decoy_minefield_8x10 | 0.14 | 0.14 | 0.14 | 0.14 | 0.34 | **0.18** |
| fog_labyrinth_10x10 | 0.52 | 0.34 | 0.34 | 0.35 | 0.35 | **0.38** |
| fog_key_hunt_8x8 | 0.13 | 0.13 | 0.15 | 0.15 | 0.15 | **0.14** |
| cascading_deduction_11x11 | 0.13 | 0.14 | 0.15 | 0.35 | 0.14 | **0.18** |
| safe_zone_identification_9x9 | 0.34 | 0.34 | 0.34 | 0.34 | 0.35 | **0.34** |

### Hardest Scenarios (OpenEnv)
1. **fog_key_hunt_8x8** (avg 0.14): Tiny viewport + fatal hazards β€” universally low
2. **decoy_minefield_8x10** (avg 0.18): Decoy-key confusion trips all models
3. **cascading_deduction_11x11** (avg 0.18): Large partial-signal board overwhelms reasoning

### Easiest Scenarios (OpenEnv)
1. **flash_fade_minefield_7x7** (avg 0.38): Pattern memory β€” some models excel here
2. **fog_labyrinth_10x10** (avg 0.38): Fog navigation with reasonable viewport

---

## SOTA Average Assessment

**SOTA models** (gpt-5.4, claude-sonnet-4-6, claude-opus-4-6):

| Reward Mode | SOTA Average | Target Band | Status |
|---|:---:|:---:|---|
| Custom | **-0.14** | 0.60–0.70 | Well below target β€” no hardening needed |
| OpenEnv | **0.28** | 0.60–0.70 | Below target β€” no hardening needed |

**All 5 models average:**

| Reward Mode | Overall Average |
|---|:---:|
| Custom | **-0.14** |
| OpenEnv | **0.28** |

The gym is currently **harder than target** across both reward modes. No hardening adjustments are required.

---

## Reward Mode Comparison

| Metric | Custom | OpenEnv |
|---|:---:|:---:|
| Mean across all models | -0.14 | 0.28 |
| Std deviation (models) | 0.15 | 0.02 |
| Min model avg | -0.33 | 0.26 |
| Max model avg | 0.08 | 0.31 |
| Hallucination penalties hit | Frequent (-1.0) | None triggered |
| Reward spread | Very high (variance from penalties) | Compressed (narrow 0.12–0.55 range) |

**Key insight**: Custom rewards produce highly volatile scores driven by the -1.0 hallucination penalty. When models make even one incorrect assertion (tools report success but ground truth disagrees), the entire scenario score collapses. OpenEnv transform rewards are more granular and forgiving, rewarding incremental progress per-step.

---

## Model Speed Rankings

| Model | Custom Time | OpenEnv Time | Avg per Scenario |
|---|:---:|:---:|:---:|
| `gpt-5.4` | 225.1s | 287.4s | ~26s |
| `claude-sonnet-4-6` | 1105.2s | 1048.0s | ~108s |
| `claude-opus-4-20250514` | 1197.4s | 1185.2s | ~119s |
| `claude-opus-4-6` | 1518.9s | 1584.6s | ~155s |
| `gpt-5` | 2967.2s | 3770.8s | ~337s |

GPT-5.4 is 6x faster than the next model while achieving competitive results.

---

## Distractor Tool Usage

Models occasionally used distractor/trap tools, which indicates susceptibility to misleading tool descriptions:

- **`peek_hidden_cell`**: Used by claude-opus-4-6 and claude-sonnet-4-6 (cheating tool β€” gives hidden info but penalized)
- **`undo_last_action`**: Used by claude-sonnet-4-6 (no-op trap)
- **`reset_scenario`**: Used by multiple models (resets game state β€” wastes steps)
- **`auto_solve`**: Not used by any model (good β€” most egregious trap avoided)

---

## Task Family Analysis

| Task Family | Scenarios | Custom Avg | OpenEnv Avg | Difficulty |
|---|---|:---:|:---:|---|
| Hidden Grid (5) | ambiguous, directional, partial, cascading, safe_zone | -0.06 | 0.29 | Medium-Hard |
| Pattern Memory (2) | flash_fade, delayed_recall | 0.00 | 0.34 | Medium |
| Fog of War (2) | fog_labyrinth, fog_key_hunt | -0.30 | 0.26 | Hard |
| Distractor Search (1) | decoy_minefield | -0.80 | 0.18 | Very Hard |

---

## Files

| Type | Path |
|---|---|
| Custom results (all 5 models) | `results/visual_memory/run_visual_memory_custom.md` |
| OpenEnv results (all 5 models) | `results/visual_memory/run_visual_memory_openenv.md` |
| Custom trajectories (all 5 models) | `trajectories/visual_memory/run_visual_memory_custom/` |
| OpenEnv trajectories (all 5 models) | `trajectories/visual_memory/run_visual_memory_openenv/` |