spreadsheet / comparison.md
kdemon1011's picture
Upload folder using huggingface_hub
6b4e5a8 verified

Spreadsheet Gym β€” Model Comparison

5 SOTA models Γ— 2 reward modes evaluated on 12 scenarios.

Summary

Model Custom Avg OpenEnv Avg GT Pass Rate Avg Steps Time
gpt-5.4 0.67 0.65 10/12 (83%) 7.8 161s
claude-opus-4-20250514 0.39 0.46 7/12 (58%) 33.6 1759s
claude-sonnet-4-6 0.33 0.44 6/12 (50%) 18.9 895s
claude-opus-4-6 -0.03 0.39 5/12 (42%) 41.3 1062s
gpt-5 -0.44 0.14 1/12 (8%) 8.8 1876s

Best model: gpt-5.4 β€” highest scores on both reward modes, fastest execution, most scenarios solved.

Per-Scenario Breakdown (Custom Mode)

Scenario gpt-5.4 sonnet-4-6 opus-4-6 opus-0514 gpt-5
buggy_template_fix_01 0.92 0.94 0.94 0.85 0.89
conditional_aggregation_01 0.96 0.87 0.81 -0.78 -0.71
conditional_aggregation_02 0.91 -0.68 -0.66 0.85 -0.68
cross_sheet_lookup_01 0.92 -0.68 -0.70 -0.80 -0.68
cross_sheet_lookup_02 0.95 0.23 0.82 0.16 -0.68
formula_repair_01 0.88 0.88 -0.69 0.86 -0.72
formula_repair_02 0.93 0.89 0.91 0.91 0.92
ledger_reconciliation_01 -0.66 0.28 -0.68 0.83 -0.68
ledger_reconciliation_02 0.92 0.22 -0.66 0.13 -0.80
messy_table_extraction_01 -0.66 0.83 -0.68 ERROR -0.75
range_transformation_01 0.98 0.86 0.91 0.83 -0.68
schedule_grid_fill_01 0.96 -0.68 -0.68 0.88 -0.69

Per-Scenario Breakdown (OpenEnv Mode)

Scenario gpt-5.4 sonnet-4-6 opus-4-6 opus-0514 gpt-5
buggy_template_fix_01 0.76 0.76 0.76 0.74 0.15
conditional_aggregation_01 0.76 0.74 0.73 0.74 0.14
conditional_aggregation_02 0.76 0.14 0.14 0.74 0.14
cross_sheet_lookup_01 0.75 0.14 0.14 0.14 0.14
cross_sheet_lookup_02 0.76 0.74 0.73 0.74 0.14
formula_repair_01 0.75 0.14 0.14 ERROR 0.15
formula_repair_02 0.76 0.75 0.75 ERROR 0.15
ledger_reconciliation_01 0.14 0.74 0.14 ERROR 0.14
ledger_reconciliation_02 0.75 0.14 0.14 0.13 0.14
messy_table_extraction_01 0.14 0.13 0.14 0.75 0.14
range_transformation_01 0.76 0.76 0.75 0.74 0.14
schedule_grid_fill_01 0.76 0.14 0.14 0.75 0.14

Step Count Comparison

Scenario max_steps gpt-5.4 sonnet-4-6 opus-4-6 opus-0514 gpt-5
buggy_template_fix_01 50 10 9 10 26 15
conditional_aggregation_01 55 6 10 234 55 9
conditional_aggregation_02 55 8 5 4 12 5
cross_sheet_lookup_01 60 9 6 7 60 5
cross_sheet_lookup_02 60 7 84 198 60 6
formula_repair_01 40 12 15 7 27 12
formula_repair_02 40 8 15 13 13 11
ledger_reconciliation_01 60 3 7 5 23 5
ledger_reconciliation_02 60 9 18 5 55 21
messy_table_extraction_01 55 3 15 4 17 8
range_transformation_01 50 5 12 7 39 4
schedule_grid_fill_01 55 8 5 5 26 6

Bold = model that solved the scenario (GT=1.0) in fewest steps.

Key Observations

1. Step Score Compression (OpenEnv Mode)

All step_score values fall between 0.31–0.41 regardless of agent quality. This is a known issue β€” the per-step reward magnitudes are too small (0.02–0.50) relative to the normalizer's expected range [-0.5, +1.0]. The entire ranking comes from ground truth alone.

2. Hallucination Penalty Dominance (Custom Mode)

The -1.0 hallucination penalty fires frequently (when all tool calls succeed but ground truth fails). This causes scores like -0.66 to -0.80, making the average for models that fail a few scenarios deeply negative β€” even if they perform well on others.

3. Efficiency Score Bug (Custom Mode)

The efficiency denominator uses len(expected_tools) (tool types, not steps). A scenario with 8 expected tool types but a realistic 20-step solution gives efficiency = 8/20 = 0.40 even for a perfect agent.

4. gpt-5 OpenEnv Anomaly

gpt-5 scored 0.14 across ALL scenarios in OpenEnv mode (GT=0.0 for every scenario). The same model scored higher on some scenarios in custom mode. This suggests a session or environment issue during the OpenEnv batch run, not a model capability problem.

5. Step Count vs Quality

Some models take extremely many steps (opus-4-6: 234 on conditional_aggregation_01, 198 on cross_sheet_lookup_02) but still achieve GT=1.0. Others fail in just 3-5 steps. The current scoring doesn't properly differentiate efficient correct agents from slow correct agents.

Hardening Assessment

SOTA average (custom): 0.67 (gpt-5.4) β€” below the 0.7 threshold.

No hardening required at this time. The scenarios are challenging enough β€” even the best model fails 2/12 scenarios.