# Spreadsheet Gym — Model Comparison

**5 SOTA models × 2 reward modes** evaluated on 12 scenarios.

## Summary

| Model | Custom Avg | OpenEnv Avg | GT Pass Rate | Avg Steps | Time |
|---|:---:|:---:|:---:|:---:|:---:|
| **gpt-5.4** | **0.67** | **0.65** | 10/12 (83%) | 7.8 | 161s |
| claude-opus-4-20250514 | 0.39 | 0.46 | 7/12 (58%) | 33.6 | 1759s |
| claude-sonnet-4-6 | 0.33 | 0.44 | 6/12 (50%) | 18.9 | 895s |
| claude-opus-4-6 | -0.03 | 0.39 | 5/12 (42%) | 41.3 | 1062s |
| gpt-5 | -0.44 | 0.14 | 1/12 (8%) | 8.8 | 1876s |

**Best model:** gpt-5.4 — highest scores on both reward modes, fastest execution, most scenarios solved.

## Per-Scenario Breakdown (Custom Mode)

| Scenario | gpt-5.4 | sonnet-4-6 | opus-4-6 | opus-0514 | gpt-5 |
|---|:---:|:---:|:---:|:---:|:---:|
| buggy_template_fix_01 | **0.92** | 0.94 | 0.94 | 0.85 | 0.89 |
| conditional_aggregation_01 | **0.96** | 0.87 | 0.81 | -0.78 | -0.71 |
| conditional_aggregation_02 | **0.91** | -0.68 | -0.66 | 0.85 | -0.68 |
| cross_sheet_lookup_01 | **0.92** | -0.68 | -0.70 | -0.80 | -0.68 |
| cross_sheet_lookup_02 | **0.95** | 0.23 | 0.82 | 0.16 | -0.68 |
| formula_repair_01 | **0.88** | 0.88 | -0.69 | 0.86 | -0.72 |
| formula_repair_02 | **0.93** | 0.89 | 0.91 | 0.91 | 0.92 |
| ledger_reconciliation_01 | -0.66 | 0.28 | -0.68 | **0.83** | -0.68 |
| ledger_reconciliation_02 | **0.92** | 0.22 | -0.66 | 0.13 | -0.80 |
| messy_table_extraction_01 | -0.66 | **0.83** | -0.68 | ERROR | -0.75 |
| range_transformation_01 | **0.98** | 0.86 | 0.91 | 0.83 | -0.68 |
| schedule_grid_fill_01 | **0.96** | -0.68 | -0.68 | 0.88 | -0.69 |

## Per-Scenario Breakdown (OpenEnv Mode)

| Scenario | gpt-5.4 | sonnet-4-6 | opus-4-6 | opus-0514 | gpt-5 |
|---|:---:|:---:|:---:|:---:|:---:|
| buggy_template_fix_01 | **0.76** | 0.76 | 0.76 | 0.74 | 0.15 |
| conditional_aggregation_01 | **0.76** | 0.74 | 0.73 | 0.74 | 0.14 |
| conditional_aggregation_02 | **0.76** | 0.14 | 0.14 | 0.74 | 0.14 |
| cross_sheet_lookup_01 | **0.75** | 0.14 | 0.14 | 0.14 | 0.14 |
| cross_sheet_lookup_02 | **0.76** | 0.74 | 0.73 | 0.74 | 0.14 |
| formula_repair_01 | **0.75** | 0.14 | 0.14 | ERROR | 0.15 |
| formula_repair_02 | **0.76** | 0.75 | 0.75 | ERROR | 0.15 |
| ledger_reconciliation_01 | 0.14 | **0.74** | 0.14 | ERROR | 0.14 |
| ledger_reconciliation_02 | **0.75** | 0.14 | 0.14 | 0.13 | 0.14 |
| messy_table_extraction_01 | 0.14 | 0.13 | 0.14 | **0.75** | 0.14 |
| range_transformation_01 | **0.76** | 0.76 | 0.75 | 0.74 | 0.14 |
| schedule_grid_fill_01 | **0.76** | 0.14 | 0.14 | 0.75 | 0.14 |

## Step Count Comparison

| Scenario | max_steps | gpt-5.4 | sonnet-4-6 | opus-4-6 | opus-0514 | gpt-5 |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
| buggy_template_fix_01 | 50 | **10** | 9 | 10 | 26 | 15 |
| conditional_aggregation_01 | 55 | **6** | 10 | 234 | 55 | 9 |
| conditional_aggregation_02 | 55 | **8** | 5 | 4 | 12 | 5 |
| cross_sheet_lookup_01 | 60 | **9** | 6 | 7 | 60 | 5 |
| cross_sheet_lookup_02 | 60 | **7** | 84 | 198 | 60 | 6 |
| formula_repair_01 | 40 | **12** | 15 | 7 | 27 | 12 |
| formula_repair_02 | 40 | **8** | 15 | 13 | 13 | 11 |
| ledger_reconciliation_01 | 60 | 3 | 7 | 5 | **23** | 5 |
| ledger_reconciliation_02 | 60 | **9** | 18 | 5 | 55 | 21 |
| messy_table_extraction_01 | 55 | 3 | **15** | 4 | 17 | 8 |
| range_transformation_01 | 50 | **5** | 12 | 7 | 39 | 4 |
| schedule_grid_fill_01 | 55 | **8** | 5 | 5 | 26 | 6 |

Bold = model that solved the scenario (GT=1.0) in fewest steps.

## Key Observations

### 1. Step Score Compression (OpenEnv Mode)

All step_score values fall between 0.31–0.41 regardless of agent quality. This is a known issue — the per-step reward magnitudes are too small (0.02–0.50) relative to the normalizer's expected range [-0.5, +1.0]. The entire ranking comes from ground truth alone.

### 2. Hallucination Penalty Dominance (Custom Mode)

The -1.0 hallucination penalty fires frequently (when all tool calls succeed but ground truth fails). This causes scores like -0.66 to -0.80, making the average for models that fail a few scenarios deeply negative — even if they perform well on others.

### 3. Efficiency Score Bug (Custom Mode)

The efficiency denominator uses `len(expected_tools)` (tool types, not steps). A scenario with 8 expected tool types but a realistic 20-step solution gives efficiency = 8/20 = 0.40 even for a perfect agent.

### 4. gpt-5 OpenEnv Anomaly

gpt-5 scored 0.14 across ALL scenarios in OpenEnv mode (GT=0.0 for every scenario). The same model scored higher on some scenarios in custom mode. This suggests a session or environment issue during the OpenEnv batch run, not a model capability problem.

### 5. Step Count vs Quality

Some models take extremely many steps (opus-4-6: 234 on conditional_aggregation_01, 198 on cross_sheet_lookup_02) but still achieve GT=1.0. Others fail in just 3-5 steps. The current scoring doesn't properly differentiate efficient correct agents from slow correct agents.

## Hardening Assessment

**SOTA average (custom): 0.67** (gpt-5.4) — below the 0.7 threshold.

No hardening required at this time. The scenarios are challenging enough — even the best model fails 2/12 scenarios.