Spaces:

huzzle-labs
/

spreadsheet

Sleeping

App Files Files Community

spreadsheet / comparison.md

kdemon1011

Upload folder using huggingface_hub

6b4e5a8 verified 26 days ago

preview code

raw

history blame contribute delete

5.17 kB

	# Spreadsheet Gym — Model Comparison

	5 SOTA models × 2 reward modes evaluated on 12 scenarios.

	## Summary

	\| Model \| Custom Avg \| OpenEnv Avg \| GT Pass Rate \| Avg Steps \| Time \|
	\|---\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| gpt-5.4 \| 0.67 \| 0.65 \| 10/12 (83%) \| 7.8 \| 161s \|
	\| claude-opus-4-20250514 \| 0.39 \| 0.46 \| 7/12 (58%) \| 33.6 \| 1759s \|
	\| claude-sonnet-4-6 \| 0.33 \| 0.44 \| 6/12 (50%) \| 18.9 \| 895s \|
	\| claude-opus-4-6 \| -0.03 \| 0.39 \| 5/12 (42%) \| 41.3 \| 1062s \|
	\| gpt-5 \| -0.44 \| 0.14 \| 1/12 (8%) \| 8.8 \| 1876s \|

	Best model: gpt-5.4 — highest scores on both reward modes, fastest execution, most scenarios solved.

	## Per-Scenario Breakdown (Custom Mode)

	\| Scenario \| gpt-5.4 \| sonnet-4-6 \| opus-4-6 \| opus-0514 \| gpt-5 \|
	\|---\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| buggy_template_fix_01 \| 0.92 \| 0.94 \| 0.94 \| 0.85 \| 0.89 \|
	\| conditional_aggregation_01 \| 0.96 \| 0.87 \| 0.81 \| -0.78 \| -0.71 \|
	\| conditional_aggregation_02 \| 0.91 \| -0.68 \| -0.66 \| 0.85 \| -0.68 \|
	\| cross_sheet_lookup_01 \| 0.92 \| -0.68 \| -0.70 \| -0.80 \| -0.68 \|
	\| cross_sheet_lookup_02 \| 0.95 \| 0.23 \| 0.82 \| 0.16 \| -0.68 \|
	\| formula_repair_01 \| 0.88 \| 0.88 \| -0.69 \| 0.86 \| -0.72 \|
	\| formula_repair_02 \| 0.93 \| 0.89 \| 0.91 \| 0.91 \| 0.92 \|
	\| ledger_reconciliation_01 \| -0.66 \| 0.28 \| -0.68 \| 0.83 \| -0.68 \|
	\| ledger_reconciliation_02 \| 0.92 \| 0.22 \| -0.66 \| 0.13 \| -0.80 \|
	\| messy_table_extraction_01 \| -0.66 \| 0.83 \| -0.68 \| ERROR \| -0.75 \|
	\| range_transformation_01 \| 0.98 \| 0.86 \| 0.91 \| 0.83 \| -0.68 \|
	\| schedule_grid_fill_01 \| 0.96 \| -0.68 \| -0.68 \| 0.88 \| -0.69 \|

	## Per-Scenario Breakdown (OpenEnv Mode)

	\| Scenario \| gpt-5.4 \| sonnet-4-6 \| opus-4-6 \| opus-0514 \| gpt-5 \|
	\|---\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| buggy_template_fix_01 \| 0.76 \| 0.76 \| 0.76 \| 0.74 \| 0.15 \|
	\| conditional_aggregation_01 \| 0.76 \| 0.74 \| 0.73 \| 0.74 \| 0.14 \|
	\| conditional_aggregation_02 \| 0.76 \| 0.14 \| 0.14 \| 0.74 \| 0.14 \|
	\| cross_sheet_lookup_01 \| 0.75 \| 0.14 \| 0.14 \| 0.14 \| 0.14 \|
	\| cross_sheet_lookup_02 \| 0.76 \| 0.74 \| 0.73 \| 0.74 \| 0.14 \|
	\| formula_repair_01 \| 0.75 \| 0.14 \| 0.14 \| ERROR \| 0.15 \|
	\| formula_repair_02 \| 0.76 \| 0.75 \| 0.75 \| ERROR \| 0.15 \|
	\| ledger_reconciliation_01 \| 0.14 \| 0.74 \| 0.14 \| ERROR \| 0.14 \|
	\| ledger_reconciliation_02 \| 0.75 \| 0.14 \| 0.14 \| 0.13 \| 0.14 \|
	\| messy_table_extraction_01 \| 0.14 \| 0.13 \| 0.14 \| 0.75 \| 0.14 \|
	\| range_transformation_01 \| 0.76 \| 0.76 \| 0.75 \| 0.74 \| 0.14 \|
	\| schedule_grid_fill_01 \| 0.76 \| 0.14 \| 0.14 \| 0.75 \| 0.14 \|

	## Step Count Comparison

	\| Scenario \| max_steps \| gpt-5.4 \| sonnet-4-6 \| opus-4-6 \| opus-0514 \| gpt-5 \|
	\|---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| buggy_template_fix_01 \| 50 \| 10 \| 9 \| 10 \| 26 \| 15 \|
	\| conditional_aggregation_01 \| 55 \| 6 \| 10 \| 234 \| 55 \| 9 \|
	\| conditional_aggregation_02 \| 55 \| 8 \| 5 \| 4 \| 12 \| 5 \|
	\| cross_sheet_lookup_01 \| 60 \| 9 \| 6 \| 7 \| 60 \| 5 \|
	\| cross_sheet_lookup_02 \| 60 \| 7 \| 84 \| 198 \| 60 \| 6 \|
	\| formula_repair_01 \| 40 \| 12 \| 15 \| 7 \| 27 \| 12 \|
	\| formula_repair_02 \| 40 \| 8 \| 15 \| 13 \| 13 \| 11 \|
	\| ledger_reconciliation_01 \| 60 \| 3 \| 7 \| 5 \| 23 \| 5 \|
	\| ledger_reconciliation_02 \| 60 \| 9 \| 18 \| 5 \| 55 \| 21 \|
	\| messy_table_extraction_01 \| 55 \| 3 \| 15 \| 4 \| 17 \| 8 \|
	\| range_transformation_01 \| 50 \| 5 \| 12 \| 7 \| 39 \| 4 \|
	\| schedule_grid_fill_01 \| 55 \| 8 \| 5 \| 5 \| 26 \| 6 \|

	Bold = model that solved the scenario (GT=1.0) in fewest steps.

	## Key Observations

	### 1. Step Score Compression (OpenEnv Mode)

	All step_score values fall between 0.31–0.41 regardless of agent quality. This is a known issue — the per-step reward magnitudes are too small (0.02–0.50) relative to the normalizer's expected range [-0.5, +1.0]. The entire ranking comes from ground truth alone.

	### 2. Hallucination Penalty Dominance (Custom Mode)

	The -1.0 hallucination penalty fires frequently (when all tool calls succeed but ground truth fails). This causes scores like -0.66 to -0.80, making the average for models that fail a few scenarios deeply negative — even if they perform well on others.

	### 3. Efficiency Score Bug (Custom Mode)

	The efficiency denominator uses `len(expected_tools)` (tool types, not steps). A scenario with 8 expected tool types but a realistic 20-step solution gives efficiency = 8/20 = 0.40 even for a perfect agent.

	### 4. gpt-5 OpenEnv Anomaly

	gpt-5 scored 0.14 across ALL scenarios in OpenEnv mode (GT=0.0 for every scenario). The same model scored higher on some scenarios in custom mode. This suggests a session or environment issue during the OpenEnv batch run, not a model capability problem.

	### 5. Step Count vs Quality

	Some models take extremely many steps (opus-4-6: 234 on conditional_aggregation_01, 198 on cross_sheet_lookup_02) but still achieve GT=1.0. Others fail in just 3-5 steps. The current scoring doesn't properly differentiate efficient correct agents from slow correct agents.

	## Hardening Assessment

	SOTA average (custom): 0.67 (gpt-5.4) — below the 0.7 threshold.

	No hardening required at this time. The scenarios are challenging enough — even the best model fails 2/12 scenarios.

	# Spreadsheet Gym — Model Comparison

	5 SOTA models × 2 reward modes evaluated on 12 scenarios.

	## Summary

	\| Model \| Custom Avg \| OpenEnv Avg \| GT Pass Rate \| Avg Steps \| Time \|
	\|---\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| gpt-5.4 \| 0.67 \| 0.65 \| 10/12 (83%) \| 7.8 \| 161s \|
	\| claude-opus-4-20250514 \| 0.39 \| 0.46 \| 7/12 (58%) \| 33.6 \| 1759s \|
	\| claude-sonnet-4-6 \| 0.33 \| 0.44 \| 6/12 (50%) \| 18.9 \| 895s \|
	\| claude-opus-4-6 \| -0.03 \| 0.39 \| 5/12 (42%) \| 41.3 \| 1062s \|
	\| gpt-5 \| -0.44 \| 0.14 \| 1/12 (8%) \| 8.8 \| 1876s \|

	Best model: gpt-5.4 — highest scores on both reward modes, fastest execution, most scenarios solved.

	## Per-Scenario Breakdown (Custom Mode)

	\| Scenario \| gpt-5.4 \| sonnet-4-6 \| opus-4-6 \| opus-0514 \| gpt-5 \|
	\|---\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| buggy_template_fix_01 \| 0.92 \| 0.94 \| 0.94 \| 0.85 \| 0.89 \|
	\| conditional_aggregation_01 \| 0.96 \| 0.87 \| 0.81 \| -0.78 \| -0.71 \|
	\| conditional_aggregation_02 \| 0.91 \| -0.68 \| -0.66 \| 0.85 \| -0.68 \|
	\| cross_sheet_lookup_01 \| 0.92 \| -0.68 \| -0.70 \| -0.80 \| -0.68 \|
	\| cross_sheet_lookup_02 \| 0.95 \| 0.23 \| 0.82 \| 0.16 \| -0.68 \|
	\| formula_repair_01 \| 0.88 \| 0.88 \| -0.69 \| 0.86 \| -0.72 \|
	\| formula_repair_02 \| 0.93 \| 0.89 \| 0.91 \| 0.91 \| 0.92 \|
	\| ledger_reconciliation_01 \| -0.66 \| 0.28 \| -0.68 \| 0.83 \| -0.68 \|
	\| ledger_reconciliation_02 \| 0.92 \| 0.22 \| -0.66 \| 0.13 \| -0.80 \|
	\| messy_table_extraction_01 \| -0.66 \| 0.83 \| -0.68 \| ERROR \| -0.75 \|
	\| range_transformation_01 \| 0.98 \| 0.86 \| 0.91 \| 0.83 \| -0.68 \|
	\| schedule_grid_fill_01 \| 0.96 \| -0.68 \| -0.68 \| 0.88 \| -0.69 \|

	## Per-Scenario Breakdown (OpenEnv Mode)

	\| Scenario \| gpt-5.4 \| sonnet-4-6 \| opus-4-6 \| opus-0514 \| gpt-5 \|
	\|---\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| buggy_template_fix_01 \| 0.76 \| 0.76 \| 0.76 \| 0.74 \| 0.15 \|
	\| conditional_aggregation_01 \| 0.76 \| 0.74 \| 0.73 \| 0.74 \| 0.14 \|
	\| conditional_aggregation_02 \| 0.76 \| 0.14 \| 0.14 \| 0.74 \| 0.14 \|
	\| cross_sheet_lookup_01 \| 0.75 \| 0.14 \| 0.14 \| 0.14 \| 0.14 \|
	\| cross_sheet_lookup_02 \| 0.76 \| 0.74 \| 0.73 \| 0.74 \| 0.14 \|
	\| formula_repair_01 \| 0.75 \| 0.14 \| 0.14 \| ERROR \| 0.15 \|
	\| formula_repair_02 \| 0.76 \| 0.75 \| 0.75 \| ERROR \| 0.15 \|
	\| ledger_reconciliation_01 \| 0.14 \| 0.74 \| 0.14 \| ERROR \| 0.14 \|
	\| ledger_reconciliation_02 \| 0.75 \| 0.14 \| 0.14 \| 0.13 \| 0.14 \|
	\| messy_table_extraction_01 \| 0.14 \| 0.13 \| 0.14 \| 0.75 \| 0.14 \|
	\| range_transformation_01 \| 0.76 \| 0.76 \| 0.75 \| 0.74 \| 0.14 \|
	\| schedule_grid_fill_01 \| 0.76 \| 0.14 \| 0.14 \| 0.75 \| 0.14 \|

	## Step Count Comparison

	\| Scenario \| max_steps \| gpt-5.4 \| sonnet-4-6 \| opus-4-6 \| opus-0514 \| gpt-5 \|
	\|---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| buggy_template_fix_01 \| 50 \| 10 \| 9 \| 10 \| 26 \| 15 \|
	\| conditional_aggregation_01 \| 55 \| 6 \| 10 \| 234 \| 55 \| 9 \|
	\| conditional_aggregation_02 \| 55 \| 8 \| 5 \| 4 \| 12 \| 5 \|
	\| cross_sheet_lookup_01 \| 60 \| 9 \| 6 \| 7 \| 60 \| 5 \|
	\| cross_sheet_lookup_02 \| 60 \| 7 \| 84 \| 198 \| 60 \| 6 \|
	\| formula_repair_01 \| 40 \| 12 \| 15 \| 7 \| 27 \| 12 \|
	\| formula_repair_02 \| 40 \| 8 \| 15 \| 13 \| 13 \| 11 \|
	\| ledger_reconciliation_01 \| 60 \| 3 \| 7 \| 5 \| 23 \| 5 \|
	\| ledger_reconciliation_02 \| 60 \| 9 \| 18 \| 5 \| 55 \| 21 \|
	\| messy_table_extraction_01 \| 55 \| 3 \| 15 \| 4 \| 17 \| 8 \|
	\| range_transformation_01 \| 50 \| 5 \| 12 \| 7 \| 39 \| 4 \|
	\| schedule_grid_fill_01 \| 55 \| 8 \| 5 \| 5 \| 26 \| 6 \|

	Bold = model that solved the scenario (GT=1.0) in fewest steps.

	## Key Observations

	### 1. Step Score Compression (OpenEnv Mode)

	All step_score values fall between 0.31–0.41 regardless of agent quality. This is a known issue — the per-step reward magnitudes are too small (0.02–0.50) relative to the normalizer's expected range [-0.5, +1.0]. The entire ranking comes from ground truth alone.

	### 2. Hallucination Penalty Dominance (Custom Mode)

	The -1.0 hallucination penalty fires frequently (when all tool calls succeed but ground truth fails). This causes scores like -0.66 to -0.80, making the average for models that fail a few scenarios deeply negative — even if they perform well on others.

	### 3. Efficiency Score Bug (Custom Mode)

	The efficiency denominator uses `len(expected_tools)` (tool types, not steps). A scenario with 8 expected tool types but a realistic 20-step solution gives efficiency = 8/20 = 0.40 even for a perfect agent.

	### 4. gpt-5 OpenEnv Anomaly

	gpt-5 scored 0.14 across ALL scenarios in OpenEnv mode (GT=0.0 for every scenario). The same model scored higher on some scenarios in custom mode. This suggests a session or environment issue during the OpenEnv batch run, not a model capability problem.

	### 5. Step Count vs Quality

	Some models take extremely many steps (opus-4-6: 234 on conditional_aggregation_01, 198 on cross_sheet_lookup_02) but still achieve GT=1.0. Others fail in just 3-5 steps. The current scoring doesn't properly differentiate efficient correct agents from slow correct agents.

	## Hardening Assessment

	SOTA average (custom): 0.67 (gpt-5.4) — below the 0.7 threshold.

	No hardening required at this time. The scenarios are challenging enough — even the best model fails 2/12 scenarios.