Spaces:

huzzle-labs
/

visual_memory

Sleeping

App Files Files Community

visual_memory / comparison.md

kdemon1011

Upload folder using huggingface_hub

816634a verified 24 days ago

preview code

raw

history blame contribute delete

7.71 kB

Visual Memory Gym — Model Comparison

Date: 2026-03-18
Gym Version: 0.1.0
Scenarios: 10 (across 4 task families)
Models: 5 (3 Anthropic, 2 OpenAI)
Reward Modes: custom (episode-level) and openenv (per-step transforms)

Overall Results

Custom Rewards (episode-level from `rewards/base.py`)

Reward components: Structural (0.25) + Ground Truth (0.60) + Efficiency (0.15) - Hallucination Penalty (up to -1.0)

#	Model	Avg Reward	Best Scenario	Worst Scenario	Total Time
1	`claude-opus-4-6`	0.08	flash_fade_minefield (0.63)	cascading_deduction (-0.83)	1518.9s
2	`gpt-5`	-0.13	flash_fade_minefield (0.69)	decoy_minefield (-0.78)	2967.2s
3	`gpt-5.4`	-0.16	partial_intel (0.63)	directional_trap (-0.81)	225.1s
4	`claude-opus-4-20250514`	-0.17	ambiguous_cluster (0.60)	cascading_deduction (-0.80)	1197.4s
5	`claude-sonnet-4-6`	-0.33	partial_intel (0.40)	directional_trap (-0.83)	1105.2s

OpenEnv Transform Rewards (per-step from `rewards/transforms/`)

Reward components: Step Rewards (0.40) + Ground Truth (0.60) - Hallucination Penalty (up to -1.0)

#	Model	Avg Reward	Best Scenario	Worst Scenario	Total Time
1	`claude-opus-4-6`	0.31	directional_trap (0.37)	decoy_minefield (0.14)	1584.6s
2	`claude-opus-4-20250514`	0.31	ambiguous_cluster (0.55)	delayed_recall_keys (0.13)	1185.2s
3	`gpt-5`	0.28	flash_fade_minefield (0.53)	ambiguous_cluster (0.12)	3770.8s
4	`gpt-5.4`	0.27	fog_labyrinth (0.52)	directional_trap (0.08)	287.4s
5	`claude-sonnet-4-6`	0.26	ambiguous_cluster (0.35)	directional_trap (0.14)	1048.0s

Per-Scenario Breakdown — Custom Rewards

Scenario	gpt-5.4	gpt-5	claude-sonnet-4-6	claude-opus-4-6	claude-opus-4-20250514	Avg
ambiguous_cluster_10x10	0.40	-0.75	0.38	0.36	0.60	0.20
directional_trap_8x8	-0.81	0.41	-0.83	0.61	0.42	-0.04
partial_intel_9x9	0.63	0.42	0.40	0.41	0.44	0.46
flash_fade_minefield_7x7	0.42	0.69	-0.78	0.63	-0.76	0.04
delayed_recall_keys_8x8	0.45	0.53	-0.80	0.43	-0.79	-0.04
decoy_minefield_8x10	-0.80	-0.78	-0.81	-0.82	-0.79	-0.80
fog_labyrinth_10x10	-0.74	0.44	0.39	0.40	0.40	0.18
fog_key_hunt_8x8	-0.73	-0.77	-0.79	-0.80	-0.79	-0.78
cascading_deduction_11x11	-0.78	-0.76	-0.79	-0.83	-0.80	-0.79
safe_zone_identification_9x9	0.40	-0.77	0.37	0.38	0.41	0.16

Hardest Scenarios (Custom)

decoy_minefield_8x10 (avg -0.80): All 5 models fail — hallucination penalty triggered universally
cascading_deduction_11x11 (avg -0.79): Large board with partial signals defeats all models
fog_key_hunt_8x8 (avg -0.78): Tiny viewport + fatal hazards — no model survives

Easiest Scenario (Custom)

partial_intel_9x9 (avg 0.46): Most models achieve positive rewards here

Per-Scenario Breakdown — OpenEnv Transform Rewards

Scenario	gpt-5.4	gpt-5	claude-sonnet-4-6	claude-opus-4-6	claude-opus-4-20250514	Avg
ambiguous_cluster_10x10	0.33	0.12	0.35	0.35	0.55	0.34
directional_trap_8x8	0.08	0.36	0.14	0.37	0.36	0.26
partial_intel_9x9	0.32	0.34	0.34	0.36	0.36	0.34
flash_fade_minefield_7x7	0.35	0.53	0.34	0.34	0.35	0.38
delayed_recall_keys_8x8	0.34	0.34	0.34	0.36	0.13	0.30
decoy_minefield_8x10	0.14	0.14	0.14	0.14	0.34	0.18
fog_labyrinth_10x10	0.52	0.34	0.34	0.35	0.35	0.38
fog_key_hunt_8x8	0.13	0.13	0.15	0.15	0.15	0.14
cascading_deduction_11x11	0.13	0.14	0.15	0.35	0.14	0.18
safe_zone_identification_9x9	0.34	0.34	0.34	0.34	0.35	0.34

Hardest Scenarios (OpenEnv)

fog_key_hunt_8x8 (avg 0.14): Tiny viewport + fatal hazards — universally low
decoy_minefield_8x10 (avg 0.18): Decoy-key confusion trips all models
cascading_deduction_11x11 (avg 0.18): Large partial-signal board overwhelms reasoning

Easiest Scenarios (OpenEnv)

flash_fade_minefield_7x7 (avg 0.38): Pattern memory — some models excel here
fog_labyrinth_10x10 (avg 0.38): Fog navigation with reasonable viewport

SOTA Average Assessment

SOTA models (gpt-5.4, claude-sonnet-4-6, claude-opus-4-6):

Reward Mode	SOTA Average	Target Band	Status
Custom	-0.14	0.60–0.70	Well below target — no hardening needed
OpenEnv	0.28	0.60–0.70	Below target — no hardening needed

All 5 models average:

Reward Mode	Overall Average
Custom	-0.14
OpenEnv	0.28

The gym is currently harder than target across both reward modes. No hardening adjustments are required.

Reward Mode Comparison

Metric	Custom	OpenEnv
Mean across all models	-0.14	0.28
Std deviation (models)	0.15	0.02
Min model avg	-0.33	0.26
Max model avg	0.08	0.31
Hallucination penalties hit	Frequent (-1.0)	None triggered
Reward spread	Very high (variance from penalties)	Compressed (narrow 0.12–0.55 range)

Key insight: Custom rewards produce highly volatile scores driven by the -1.0 hallucination penalty. When models make even one incorrect assertion (tools report success but ground truth disagrees), the entire scenario score collapses. OpenEnv transform rewards are more granular and forgiving, rewarding incremental progress per-step.

Model Speed Rankings

Model	Custom Time	OpenEnv Time	Avg per Scenario
`gpt-5.4`	225.1s	287.4s	~26s
`claude-sonnet-4-6`	1105.2s	1048.0s	~108s
`claude-opus-4-20250514`	1197.4s	1185.2s	~119s
`claude-opus-4-6`	1518.9s	1584.6s	~155s
`gpt-5`	2967.2s	3770.8s	~337s

GPT-5.4 is 6x faster than the next model while achieving competitive results.

Distractor Tool Usage

Models occasionally used distractor/trap tools, which indicates susceptibility to misleading tool descriptions:

peek_hidden_cell: Used by claude-opus-4-6 and claude-sonnet-4-6 (cheating tool — gives hidden info but penalized)
undo_last_action: Used by claude-sonnet-4-6 (no-op trap)
reset_scenario: Used by multiple models (resets game state — wastes steps)
auto_solve: Not used by any model (good — most egregious trap avoided)

Task Family Analysis

Task Family	Scenarios	Custom Avg	OpenEnv Avg	Difficulty
Hidden Grid (5)	ambiguous, directional, partial, cascading, safe_zone	-0.06	0.29	Medium-Hard
Pattern Memory (2)	flash_fade, delayed_recall	0.00	0.34	Medium
Fog of War (2)	fog_labyrinth, fog_key_hunt	-0.30	0.26	Hard
Distractor Search (1)	decoy_minefield	-0.80	0.18	Very Hard

Files

Type	Path
Custom results (all 5 models)	`results/visual_memory/run_visual_memory_custom.md`
OpenEnv results (all 5 models)	`results/visual_memory/run_visual_memory_openenv.md`
Custom trajectories (all 5 models)	`trajectories/visual_memory/run_visual_memory_custom/`
OpenEnv trajectories (all 5 models)	`trajectories/visual_memory/run_visual_memory_openenv/`