Spaces:
Sleeping
Spreadsheet Gym β Model Comparison
5 SOTA models Γ 2 reward modes evaluated on 12 scenarios.
Summary
| Model | Custom Avg | OpenEnv Avg | GT Pass Rate | Avg Steps | Time |
|---|---|---|---|---|---|
| gpt-5.4 | 0.67 | 0.65 | 10/12 (83%) | 7.8 | 161s |
| claude-opus-4-20250514 | 0.39 | 0.46 | 7/12 (58%) | 33.6 | 1759s |
| claude-sonnet-4-6 | 0.33 | 0.44 | 6/12 (50%) | 18.9 | 895s |
| claude-opus-4-6 | -0.03 | 0.39 | 5/12 (42%) | 41.3 | 1062s |
| gpt-5 | -0.44 | 0.14 | 1/12 (8%) | 8.8 | 1876s |
Best model: gpt-5.4 β highest scores on both reward modes, fastest execution, most scenarios solved.
Per-Scenario Breakdown (Custom Mode)
| Scenario | gpt-5.4 | sonnet-4-6 | opus-4-6 | opus-0514 | gpt-5 |
|---|---|---|---|---|---|
| buggy_template_fix_01 | 0.92 | 0.94 | 0.94 | 0.85 | 0.89 |
| conditional_aggregation_01 | 0.96 | 0.87 | 0.81 | -0.78 | -0.71 |
| conditional_aggregation_02 | 0.91 | -0.68 | -0.66 | 0.85 | -0.68 |
| cross_sheet_lookup_01 | 0.92 | -0.68 | -0.70 | -0.80 | -0.68 |
| cross_sheet_lookup_02 | 0.95 | 0.23 | 0.82 | 0.16 | -0.68 |
| formula_repair_01 | 0.88 | 0.88 | -0.69 | 0.86 | -0.72 |
| formula_repair_02 | 0.93 | 0.89 | 0.91 | 0.91 | 0.92 |
| ledger_reconciliation_01 | -0.66 | 0.28 | -0.68 | 0.83 | -0.68 |
| ledger_reconciliation_02 | 0.92 | 0.22 | -0.66 | 0.13 | -0.80 |
| messy_table_extraction_01 | -0.66 | 0.83 | -0.68 | ERROR | -0.75 |
| range_transformation_01 | 0.98 | 0.86 | 0.91 | 0.83 | -0.68 |
| schedule_grid_fill_01 | 0.96 | -0.68 | -0.68 | 0.88 | -0.69 |
Per-Scenario Breakdown (OpenEnv Mode)
| Scenario | gpt-5.4 | sonnet-4-6 | opus-4-6 | opus-0514 | gpt-5 |
|---|---|---|---|---|---|
| buggy_template_fix_01 | 0.76 | 0.76 | 0.76 | 0.74 | 0.15 |
| conditional_aggregation_01 | 0.76 | 0.74 | 0.73 | 0.74 | 0.14 |
| conditional_aggregation_02 | 0.76 | 0.14 | 0.14 | 0.74 | 0.14 |
| cross_sheet_lookup_01 | 0.75 | 0.14 | 0.14 | 0.14 | 0.14 |
| cross_sheet_lookup_02 | 0.76 | 0.74 | 0.73 | 0.74 | 0.14 |
| formula_repair_01 | 0.75 | 0.14 | 0.14 | ERROR | 0.15 |
| formula_repair_02 | 0.76 | 0.75 | 0.75 | ERROR | 0.15 |
| ledger_reconciliation_01 | 0.14 | 0.74 | 0.14 | ERROR | 0.14 |
| ledger_reconciliation_02 | 0.75 | 0.14 | 0.14 | 0.13 | 0.14 |
| messy_table_extraction_01 | 0.14 | 0.13 | 0.14 | 0.75 | 0.14 |
| range_transformation_01 | 0.76 | 0.76 | 0.75 | 0.74 | 0.14 |
| schedule_grid_fill_01 | 0.76 | 0.14 | 0.14 | 0.75 | 0.14 |
Step Count Comparison
| Scenario | max_steps | gpt-5.4 | sonnet-4-6 | opus-4-6 | opus-0514 | gpt-5 |
|---|---|---|---|---|---|---|
| buggy_template_fix_01 | 50 | 10 | 9 | 10 | 26 | 15 |
| conditional_aggregation_01 | 55 | 6 | 10 | 234 | 55 | 9 |
| conditional_aggregation_02 | 55 | 8 | 5 | 4 | 12 | 5 |
| cross_sheet_lookup_01 | 60 | 9 | 6 | 7 | 60 | 5 |
| cross_sheet_lookup_02 | 60 | 7 | 84 | 198 | 60 | 6 |
| formula_repair_01 | 40 | 12 | 15 | 7 | 27 | 12 |
| formula_repair_02 | 40 | 8 | 15 | 13 | 13 | 11 |
| ledger_reconciliation_01 | 60 | 3 | 7 | 5 | 23 | 5 |
| ledger_reconciliation_02 | 60 | 9 | 18 | 5 | 55 | 21 |
| messy_table_extraction_01 | 55 | 3 | 15 | 4 | 17 | 8 |
| range_transformation_01 | 50 | 5 | 12 | 7 | 39 | 4 |
| schedule_grid_fill_01 | 55 | 8 | 5 | 5 | 26 | 6 |
Bold = model that solved the scenario (GT=1.0) in fewest steps.
Key Observations
1. Step Score Compression (OpenEnv Mode)
All step_score values fall between 0.31β0.41 regardless of agent quality. This is a known issue β the per-step reward magnitudes are too small (0.02β0.50) relative to the normalizer's expected range [-0.5, +1.0]. The entire ranking comes from ground truth alone.
2. Hallucination Penalty Dominance (Custom Mode)
The -1.0 hallucination penalty fires frequently (when all tool calls succeed but ground truth fails). This causes scores like -0.66 to -0.80, making the average for models that fail a few scenarios deeply negative β even if they perform well on others.
3. Efficiency Score Bug (Custom Mode)
The efficiency denominator uses len(expected_tools) (tool types, not steps). A scenario with 8 expected tool types but a realistic 20-step solution gives efficiency = 8/20 = 0.40 even for a perfect agent.
4. gpt-5 OpenEnv Anomaly
gpt-5 scored 0.14 across ALL scenarios in OpenEnv mode (GT=0.0 for every scenario). The same model scored higher on some scenarios in custom mode. This suggests a session or environment issue during the OpenEnv batch run, not a model capability problem.
5. Step Count vs Quality
Some models take extremely many steps (opus-4-6: 234 on conditional_aggregation_01, 198 on cross_sheet_lookup_02) but still achieve GT=1.0. Others fail in just 3-5 steps. The current scoring doesn't properly differentiate efficient correct agents from slow correct agents.
Hardening Assessment
SOTA average (custom): 0.67 (gpt-5.4) β below the 0.7 threshold.
No hardening required at this time. The scenarios are challenging enough β even the best model fails 2/12 scenarios.