# Spreadsheet Gym — Model Comparison **5 SOTA models × 2 reward modes** evaluated on 12 scenarios. ## Summary | Model | Custom Avg | OpenEnv Avg | GT Pass Rate | Avg Steps | Time | |---|:---:|:---:|:---:|:---:|:---:| | **gpt-5.4** | **0.67** | **0.65** | 10/12 (83%) | 7.8 | 161s | | claude-opus-4-20250514 | 0.39 | 0.46 | 7/12 (58%) | 33.6 | 1759s | | claude-sonnet-4-6 | 0.33 | 0.44 | 6/12 (50%) | 18.9 | 895s | | claude-opus-4-6 | -0.03 | 0.39 | 5/12 (42%) | 41.3 | 1062s | | gpt-5 | -0.44 | 0.14 | 1/12 (8%) | 8.8 | 1876s | **Best model:** gpt-5.4 — highest scores on both reward modes, fastest execution, most scenarios solved. ## Per-Scenario Breakdown (Custom Mode) | Scenario | gpt-5.4 | sonnet-4-6 | opus-4-6 | opus-0514 | gpt-5 | |---|:---:|:---:|:---:|:---:|:---:| | buggy_template_fix_01 | **0.92** | 0.94 | 0.94 | 0.85 | 0.89 | | conditional_aggregation_01 | **0.96** | 0.87 | 0.81 | -0.78 | -0.71 | | conditional_aggregation_02 | **0.91** | -0.68 | -0.66 | 0.85 | -0.68 | | cross_sheet_lookup_01 | **0.92** | -0.68 | -0.70 | -0.80 | -0.68 | | cross_sheet_lookup_02 | **0.95** | 0.23 | 0.82 | 0.16 | -0.68 | | formula_repair_01 | **0.88** | 0.88 | -0.69 | 0.86 | -0.72 | | formula_repair_02 | **0.93** | 0.89 | 0.91 | 0.91 | 0.92 | | ledger_reconciliation_01 | -0.66 | 0.28 | -0.68 | **0.83** | -0.68 | | ledger_reconciliation_02 | **0.92** | 0.22 | -0.66 | 0.13 | -0.80 | | messy_table_extraction_01 | -0.66 | **0.83** | -0.68 | ERROR | -0.75 | | range_transformation_01 | **0.98** | 0.86 | 0.91 | 0.83 | -0.68 | | schedule_grid_fill_01 | **0.96** | -0.68 | -0.68 | 0.88 | -0.69 | ## Per-Scenario Breakdown (OpenEnv Mode) | Scenario | gpt-5.4 | sonnet-4-6 | opus-4-6 | opus-0514 | gpt-5 | |---|:---:|:---:|:---:|:---:|:---:| | buggy_template_fix_01 | **0.76** | 0.76 | 0.76 | 0.74 | 0.15 | | conditional_aggregation_01 | **0.76** | 0.74 | 0.73 | 0.74 | 0.14 | | conditional_aggregation_02 | **0.76** | 0.14 | 0.14 | 0.74 | 0.14 | | cross_sheet_lookup_01 | **0.75** | 0.14 | 0.14 | 0.14 | 0.14 | | cross_sheet_lookup_02 | **0.76** | 0.74 | 0.73 | 0.74 | 0.14 | | formula_repair_01 | **0.75** | 0.14 | 0.14 | ERROR | 0.15 | | formula_repair_02 | **0.76** | 0.75 | 0.75 | ERROR | 0.15 | | ledger_reconciliation_01 | 0.14 | **0.74** | 0.14 | ERROR | 0.14 | | ledger_reconciliation_02 | **0.75** | 0.14 | 0.14 | 0.13 | 0.14 | | messy_table_extraction_01 | 0.14 | 0.13 | 0.14 | **0.75** | 0.14 | | range_transformation_01 | **0.76** | 0.76 | 0.75 | 0.74 | 0.14 | | schedule_grid_fill_01 | **0.76** | 0.14 | 0.14 | 0.75 | 0.14 | ## Step Count Comparison | Scenario | max_steps | gpt-5.4 | sonnet-4-6 | opus-4-6 | opus-0514 | gpt-5 | |---|:---:|:---:|:---:|:---:|:---:|:---:| | buggy_template_fix_01 | 50 | **10** | 9 | 10 | 26 | 15 | | conditional_aggregation_01 | 55 | **6** | 10 | 234 | 55 | 9 | | conditional_aggregation_02 | 55 | **8** | 5 | 4 | 12 | 5 | | cross_sheet_lookup_01 | 60 | **9** | 6 | 7 | 60 | 5 | | cross_sheet_lookup_02 | 60 | **7** | 84 | 198 | 60 | 6 | | formula_repair_01 | 40 | **12** | 15 | 7 | 27 | 12 | | formula_repair_02 | 40 | **8** | 15 | 13 | 13 | 11 | | ledger_reconciliation_01 | 60 | 3 | 7 | 5 | **23** | 5 | | ledger_reconciliation_02 | 60 | **9** | 18 | 5 | 55 | 21 | | messy_table_extraction_01 | 55 | 3 | **15** | 4 | 17 | 8 | | range_transformation_01 | 50 | **5** | 12 | 7 | 39 | 4 | | schedule_grid_fill_01 | 55 | **8** | 5 | 5 | 26 | 6 | Bold = model that solved the scenario (GT=1.0) in fewest steps. ## Key Observations ### 1. Step Score Compression (OpenEnv Mode) All step_score values fall between 0.31–0.41 regardless of agent quality. This is a known issue — the per-step reward magnitudes are too small (0.02–0.50) relative to the normalizer's expected range [-0.5, +1.0]. The entire ranking comes from ground truth alone. ### 2. Hallucination Penalty Dominance (Custom Mode) The -1.0 hallucination penalty fires frequently (when all tool calls succeed but ground truth fails). This causes scores like -0.66 to -0.80, making the average for models that fail a few scenarios deeply negative — even if they perform well on others. ### 3. Efficiency Score Bug (Custom Mode) The efficiency denominator uses `len(expected_tools)` (tool types, not steps). A scenario with 8 expected tool types but a realistic 20-step solution gives efficiency = 8/20 = 0.40 even for a perfect agent. ### 4. gpt-5 OpenEnv Anomaly gpt-5 scored 0.14 across ALL scenarios in OpenEnv mode (GT=0.0 for every scenario). The same model scored higher on some scenarios in custom mode. This suggests a session or environment issue during the OpenEnv batch run, not a model capability problem. ### 5. Step Count vs Quality Some models take extremely many steps (opus-4-6: 234 on conditional_aggregation_01, 198 on cross_sheet_lookup_02) but still achieve GT=1.0. Others fail in just 3-5 steps. The current scoring doesn't properly differentiate efficient correct agents from slow correct agents. ## Hardening Assessment **SOTA average (custom): 0.67** (gpt-5.4) — below the 0.7 threshold. No hardening required at this time. The scenarios are challenging enough — even the best model fails 2/12 scenarios.