Spaces:

huzzle-labs
/

spreadsheet

Sleeping

App Files Files Community

spreadsheet / comparison.md

kdemon1011

Upload folder using huggingface_hub

6b4e5a8 verified 23 days ago

preview code

raw

history blame contribute delete

5.17 kB

Spreadsheet Gym — Model Comparison

5 SOTA models × 2 reward modes evaluated on 12 scenarios.

Summary

Model	Custom Avg	OpenEnv Avg	GT Pass Rate	Avg Steps	Time
gpt-5.4	0.67	0.65	10/12 (83%)	7.8	161s
claude-opus-4-20250514	0.39	0.46	7/12 (58%)	33.6	1759s
claude-sonnet-4-6	0.33	0.44	6/12 (50%)	18.9	895s
claude-opus-4-6	-0.03	0.39	5/12 (42%)	41.3	1062s
gpt-5	-0.44	0.14	1/12 (8%)	8.8	1876s

Best model: gpt-5.4 — highest scores on both reward modes, fastest execution, most scenarios solved.

Per-Scenario Breakdown (Custom Mode)

Scenario	gpt-5.4	sonnet-4-6	opus-4-6	opus-0514	gpt-5
buggy_template_fix_01	0.92	0.94	0.94	0.85	0.89
conditional_aggregation_01	0.96	0.87	0.81	-0.78	-0.71
conditional_aggregation_02	0.91	-0.68	-0.66	0.85	-0.68
cross_sheet_lookup_01	0.92	-0.68	-0.70	-0.80	-0.68
cross_sheet_lookup_02	0.95	0.23	0.82	0.16	-0.68
formula_repair_01	0.88	0.88	-0.69	0.86	-0.72
formula_repair_02	0.93	0.89	0.91	0.91	0.92
ledger_reconciliation_01	-0.66	0.28	-0.68	0.83	-0.68
ledger_reconciliation_02	0.92	0.22	-0.66	0.13	-0.80
messy_table_extraction_01	-0.66	0.83	-0.68	ERROR	-0.75
range_transformation_01	0.98	0.86	0.91	0.83	-0.68
schedule_grid_fill_01	0.96	-0.68	-0.68	0.88	-0.69

Per-Scenario Breakdown (OpenEnv Mode)

Scenario	gpt-5.4	sonnet-4-6	opus-4-6	opus-0514	gpt-5
buggy_template_fix_01	0.76	0.76	0.76	0.74	0.15
conditional_aggregation_01	0.76	0.74	0.73	0.74	0.14
conditional_aggregation_02	0.76	0.14	0.14	0.74	0.14
cross_sheet_lookup_01	0.75	0.14	0.14	0.14	0.14
cross_sheet_lookup_02	0.76	0.74	0.73	0.74	0.14
formula_repair_01	0.75	0.14	0.14	ERROR	0.15
formula_repair_02	0.76	0.75	0.75	ERROR	0.15
ledger_reconciliation_01	0.14	0.74	0.14	ERROR	0.14
ledger_reconciliation_02	0.75	0.14	0.14	0.13	0.14
messy_table_extraction_01	0.14	0.13	0.14	0.75	0.14
range_transformation_01	0.76	0.76	0.75	0.74	0.14
schedule_grid_fill_01	0.76	0.14	0.14	0.75	0.14

Step Count Comparison

Scenario	max_steps	gpt-5.4	sonnet-4-6	opus-4-6	opus-0514	gpt-5
buggy_template_fix_01	50	10	9	10	26	15
conditional_aggregation_01	55	6	10	234	55	9
conditional_aggregation_02	55	8	5	4	12	5
cross_sheet_lookup_01	60	9	6	7	60	5
cross_sheet_lookup_02	60	7	84	198	60	6
formula_repair_01	40	12	15	7	27	12
formula_repair_02	40	8	15	13	13	11
ledger_reconciliation_01	60	3	7	5	23	5
ledger_reconciliation_02	60	9	18	5	55	21
messy_table_extraction_01	55	3	15	4	17	8
range_transformation_01	50	5	12	7	39	4
schedule_grid_fill_01	55	8	5	5	26	6

Bold = model that solved the scenario (GT=1.0) in fewest steps.

Key Observations

1. Step Score Compression (OpenEnv Mode)

All step_score values fall between 0.31–0.41 regardless of agent quality. This is a known issue — the per-step reward magnitudes are too small (0.02–0.50) relative to the normalizer's expected range [-0.5, +1.0]. The entire ranking comes from ground truth alone.

2. Hallucination Penalty Dominance (Custom Mode)

The -1.0 hallucination penalty fires frequently (when all tool calls succeed but ground truth fails). This causes scores like -0.66 to -0.80, making the average for models that fail a few scenarios deeply negative — even if they perform well on others.

3. Efficiency Score Bug (Custom Mode)

The efficiency denominator uses len(expected_tools) (tool types, not steps). A scenario with 8 expected tool types but a realistic 20-step solution gives efficiency = 8/20 = 0.40 even for a perfect agent.

4. gpt-5 OpenEnv Anomaly

gpt-5 scored 0.14 across ALL scenarios in OpenEnv mode (GT=0.0 for every scenario). The same model scored higher on some scenarios in custom mode. This suggests a session or environment issue during the OpenEnv batch run, not a model capability problem.

5. Step Count vs Quality

Some models take extremely many steps (opus-4-6: 234 on conditional_aggregation_01, 198 on cross_sheet_lookup_02) but still achieve GT=1.0. Others fail in just 3-5 steps. The current scoring doesn't properly differentiate efficient correct agents from slow correct agents.

Hardening Assessment

SOTA average (custom): 0.67 (gpt-5.4) — below the 0.7 threshold.

No hardening required at this time. The scenarios are challenging enough — even the best model fails 2/12 scenarios.