spreadsheet / README.md
kdemon1011's picture
Upload folder using huggingface_hub
fded8f2 verified
metadata
title: Spreadsheet Environment
emoji: πŸ“Š
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
license: mit
app_port: 8000
base_path: /web
tags:
  - openenv
  - openenv-0.2.3
  - rl-environment

Spreadsheet Gym

Exact workbook manipulation and reasoning over realistic spreadsheet tasks.

An OpenEnv RL environment where agents must read, understand, and edit real .xlsx workbooks to solve structured tasks β€” formula repair, cross-sheet lookups, ledger reconciliation, messy data extraction, and more. Designed to stress structured state tracking, cross-sheet reasoning, non-standard table layouts, and exact edit correctness β€” areas where frontier LLMs consistently struggle.

Playground Quick Start

Use the Playground panel (right side) to interact with the environment. Type a Tool Name and Arguments Json, then click Step.

Typical workflow

  1. Click Reset to start a fresh session
  2. Enter list_tools (args: {}) β†’ discover all available tools and their parameters
  3. Enter list_scenarios (args: {}) β†’ see all 12 scenarios
  4. Enter load_scenario (args: {"scenario_id": "formula_repair_01"}) β†’ start a task
  5. Enter list_sheets (args: {}) β†’ see all sheets in the workbook
  6. Enter read_range (args: {"sheet": "Summary", "range": "A1:F10"}) β†’ read cell values
  7. Enter inspect_formula (args: {"sheet": "Summary", "cell": "C5"}) β†’ see the raw formula
  8. Enter write_cell (args: {"sheet": "Summary", "cell": "C5", "value": "=SUM(B2:B10)"}) β†’ fix a formula
  9. Enter validate_partial (args: {}) β†’ check how many hidden tests pass so far
  10. Enter submit_workbook (args: {}) β†’ submit for final evaluation (ends the task)

All tool commands (copy-paste ready)

Discovery & session tools

Tool Name Arguments Json Description
list_tools {} List every available tool with its parameters and types
get_session_info {} Current session ID, loaded scenario, step count, edit count, solve status
list_scenarios {} List all 12 scenarios with description, workbook name, and max steps
load_scenario {"scenario_id": "formula_repair_01"} Load a scenario and its workbook to begin working
reset_scenario {} Restore workbook to original state, keeping the scenario loaded

Reading tools

Tool Name Arguments Json Description
list_sheets {} List all sheets with names, dimensions, and visibility
read_range {"sheet": "Summary", "range": "B2:D10"} Read a rectangular range of cells (formulas shown as strings)
inspect_formula {"sheet": "Summary", "cell": "C15"} Return the raw formula string from a cell
list_named_targets {} Show target areas and allowed output zones for the scenario

Writing tools

Tool Name Arguments Json Description
write_cell {"sheet": "Summary", "cell": "C15", "value": "=SUM(B2:B10)"} Write a value or formula to a single cell
write_range {"sheet": "Summary", "start_cell": "A1", "data": "[[1, 2], [3, 4]]"} Write a 2D block of values starting from a cell

Note: write_range takes start_cell (not cell). The data argument is a JSON string of a 2D array.

Validation & submission tools

Tool Name Arguments Json Description
validate_partial {} Check partial progress β€” how many hidden tests pass/fail (no answers revealed)
submit_workbook {} Submit for final evaluation β€” returns pass rate and per-check results

History tools

Tool Name Arguments Json Description
get_edit_history {} Full list of cell edits: sheet, cell, value, step number

Important notes

  • All string parameters are required β€” no optional arguments on any tool
  • write_cell values starting with = are treated as formulas (e.g. "=VLOOKUP(A2,Sheet2!A:B,2,FALSE)")
  • write_range data must be a JSON string: "[[1, 2], [3, 4]]" not [[1, 2], [3, 4]]
  • Writing outside target regions incurs a reward penalty
  • Use validate_partial before submit_workbook to check progress without ending the task

Run locally

cd spreadsheet
pip install -e .

# Start the environment server
docker build -t openenv-spreadsheet -f Dockerfile .
docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet

# Verify it's running
curl http://localhost:8000/health

# Open the playground in your browser
open http://localhost:8000/web/

Hugging Face Space Deployment

This Space is built from OpenEnV environment spreadsheet.

  • Space URL: https://huggingface.co/spaces/huzzle-labs/spreadsheet
  • OpenEnV pinned ref: 0.2.3
  • Hub tag: openenv

Connecting from Code

Connect using the SpreadsheetEnv client:

from spreadsheet import SpreadsheetAction, SpreadsheetEnv

with SpreadsheetEnv.from_env("huzzle-labs/spreadsheet") as env:
    obs = env.reset()
    obs = await env.step(SpreadsheetAction(
        tool_name="list_scenarios",
        arguments_json="{}"
    ))
    obs = await env.step(SpreadsheetAction(
        tool_name="load_scenario",
        arguments_json='{"scenario_id": "formula_repair_01"}'
    ))
    obs = await env.step(SpreadsheetAction(
        tool_name="read_range",
        arguments_json='{"sheet": "Summary", "range": "A1:F10"}'
    ))

Or connect directly to a running server:

env = SpreadsheetEnv(base_url="https://huzzle-labs-spreadsheet.hf.space")

What Is This Gym?

The Spreadsheet gym gives an LLM agent a real .xlsx workbook and a task description. The agent must use MCP tools to read sheets, understand the structure, write values or formulas, and submit the workbook for automated evaluation against hidden test checks. Every edit is tracked, and the agent must stay within target regions and step budgets.

Unlike typical code-generation or QA benchmarks, this gym requires:

  • Structured state tracking β€” understanding multi-sheet workbook layouts with varying column structures
  • Cross-sheet reasoning β€” performing lookups, aggregations, and reconciliations across sheets
  • Exact edit correctness β€” writing precise formulas and values that pass deterministic hidden tests
  • Strategic tool use β€” using validate_partial to check progress before committing with submit_workbook

Task Families (12 Scenarios)

Formula Repair (2 scenarios)

Fix broken formulas in multi-department workbooks. Diagnose incorrect references, cascading errors, and wrong aggregation functions.

Scenario Description Max Steps
formula_repair_01 Fix broken formulas in a multi-department budget workbook 50
formula_repair_02 Fix cascading formula errors in a 5-year financial projection 50

Cross-Sheet Lookup (2 scenarios)

Aggregate data across multiple sheets using lookups and cross-references.

Scenario Description Max Steps
cross_sheet_lookup_01 Aggregate product revenue by region/category across quarterly sheets 50
cross_sheet_lookup_02 Calculate employee bonuses by cross-referencing Employees and Bonus_Tiers 50

Conditional Aggregation (2 scenarios)

Apply tiered calculations with conditional logic and priority-based allocation.

Scenario Description Max Steps
conditional_aggregation_01 Calculate tiered sales commissions for 15 salespeople 50
conditional_aggregation_02 Allocate a fixed budget across 20 requests with priority-based rates 50

Ledger Reconciliation (2 scenarios)

Match and reconcile transactions across bank statements and internal ledgers.

Scenario Description Max Steps
ledger_reconciliation_01 Reconcile bank statement against internal ledger β€” find mismatches 50
ledger_reconciliation_02 Reconcile USD and EUR transaction sheets into a unified summary 50

Messy Table Extraction (1 scenario)

Extract and clean data from poorly formatted raw exports.

Scenario Description Max Steps
messy_table_extraction_01 Extract/clean invoice data from messy export with mixed formats 50

Range Transformation (1 scenario)

Reshape and pivot data between long-format and wide-format layouts.

Scenario Description Max Steps
range_transformation_01 Pivot long-format employee metrics into wide-format table 50

Schedule Grid Fill (1 scenario)

Fill structured grids respecting constraints and rules.

Scenario Description Max Steps
schedule_grid_fill_01 Fill employee schedule grid for 12 employees Γ— 7 days 50

Buggy Template Fix (1 scenario)

Debug template workbooks with multiple interacting formula errors.

Scenario Description Max Steps
buggy_template_fix_01 Debug quarterly financial report template with broken Annual_Summary 50

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           OpenEnv Server (:8000)        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  FastMCP   │──│ SpreadsheetEnv    β”‚  β”‚
β”‚  β”‚ (13 tools) β”‚  β”‚ (MCPEnvironment)  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                           β”‚             β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚              β”‚  Workbook  β”‚ Scenario β”‚  β”‚
β”‚              β”‚  Engine    β”‚  Loader  β”‚  β”‚
β”‚              β”‚ (openpyxl) β”‚          β”‚  β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

All state is in-memory per session. No database, no external APIs. The workbook engine manages .xlsx files via openpyxl, tracks edits, and evaluates hidden tests. Formula evaluation uses the formulas library.

MCP Tools (13 total)

Session Management (4 tools)

Tool Description
get_session_info Get current session metadata (scenario, step count, edit count, solved)
list_scenarios List all available scenarios with description and max steps
load_scenario Load a scenario and its workbook by ID
reset_scenario Restore workbook to original state (scenario stays loaded)

Reading (4 tools)

Tool Description
list_sheets List all sheets with names, dimensions, visibility
read_range Read cells from a sheet in A1 notation (formulas as strings)
inspect_formula Get raw formula string from a specific cell
list_named_targets Show allowed output zones for the scenario

Writing (2 tools)

Tool Description
write_cell Write a value or formula to a single cell
write_range Write a 2D block of values starting from a cell

Validation & Submission (2 tools)

Tool Description
validate_partial Check progress against hidden tests without revealing answers
submit_workbook Submit for final evaluation (pass rate + per-check results)

History (1 tool)

Tool Description
get_edit_history Full edit log with sheet, cell, value, step number

Reward System

This gym ships with two reward modes, selectable via --reward-mode:

Custom Rewards β€” Episode-Level (rewards/checks.py)

The SpreadsheetChecker verifies ground truth from the episode trajectory and computes a weighted score:

Component Weight Description
quality 0.25 F1 of expected vs used tools + success rate
efficiency 0.15 1.0 - (actual_steps / max_steps) β€” fewer steps = higher
ground_truth 0.60 Hidden test pass rate from submit_workbook
penalty variable -0.5 (all calls succeed but 0% GT) or -0.2 (<30% GT)
total = 0.25 Γ— quality + 0.15 Γ— efficiency + 0.60 Γ— ground_truth + penalty
from rewards.checks import SpreadsheetChecker

checker = SpreadsheetChecker()
checker.set_episode(episode)
reward = checker.compute_episode_reward()
# {'quality': 0.72, 'efficiency': 0.65, 'ground_truth': 0.80, ..., 'total': 0.68}

The base RewardCalculator (rewards/base.py) wraps this into the standard 3-component formula used across all gyms.

OpenEnV Transforms β€” Per-Step (rewards/transforms.py)

The SpreadsheetStepTransform provides fine-grained per-step rewards for RL training (GRPO). Each tool call receives a reward based on its outcome:

Tool Success Failure
read_range / list_sheets 0.0 (neutral) 0.0
inspect_formula +0.05 0.0
validate_partial (improved) +0.10 +0.05
write_cell / write_range (in target, after read) +0.10 -0.10
write_cell / write_range (out of target) -0.10 -0.10
submit_workbook (100% pass) +0.50 β€”
submit_workbook (>50% pass) +0.20 β€”
submit_workbook (<30% pass) -0.10 β€”
from rewards.transforms import SpreadsheetStepTransform

transform = SpreadsheetStepTransform()
scored_obs = transform(observation)
print(scored_obs.reward)  # e.g., +0.10 for a write in target after reading

The OpenEnvRewardCalculator (rewards/base.py) combines per-step rewards with ground truth into the same weighted formula, using sign-based quality scoring.

Evaluation

The included run_eval.py runs an LLM agent against scenarios and scores results.

Quick Start

cd spreadsheet
pip install -e .

# Build and run the environment
docker build -t openenv-spreadsheet -f Dockerfile .
docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet

# Verify
curl http://localhost:8000/health

# Evaluate (single model, custom rewards)
python run_eval.py --model gpt-5.4 --save --trajectory

# Evaluate (multiple models, per-step rewards)
python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 \
  --parallel 3 --reward-mode openenv --save --trajectory

# Evaluate a specific scenario
python run_eval.py --model gpt-5.4 --scenario formula_repair_01

# Cleanup
docker stop spreadsheet && docker rm spreadsheet

Output Paths

Output Path
Results markdown outputs/results/<run_id>.md
Trajectory JSON outputs/trajectories/<run_id>/<model>.json

Results files append per-model sections so you can accumulate multiple model runs in one file.

CLI Arguments

Argument Default Description
--model gpt-4o LiteLLM model string (comma-separated for parallel)
--scenario all Run a specific scenario by ID
--reward-mode custom custom (episode-level) or openenv (per-step)
--parallel 1 Number of models to run in parallel
--save off Save results markdown
--trajectory off Save trajectory JSON
--temperature 0.0 LLM sampling temperature
--max-tokens 1024 Max tokens per LLM response
--run-id auto Run identifier for grouping outputs
--verbose off Enable debug logging

Project Structure

spreadsheet/
β”œβ”€β”€ __init__.py                  # Package exports (env + rewards)
β”œβ”€β”€ client.py                    # OpenEnv client integration
β”œβ”€β”€ models.py                    # Action/Observation data models
β”œβ”€β”€ openenv.yaml                 # OpenEnv AutoEnv manifest
β”œβ”€β”€ pyproject.toml               # Dependencies (openenv-core v0.2.3)
β”œβ”€β”€ Dockerfile                   # Root Dockerfile for HF Spaces
β”œβ”€β”€ .dockerignore
β”œβ”€β”€ run_eval.py                  # LLM evaluation runner
β”‚
β”œβ”€β”€ rewards/                     # Reward system (both modes)
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base.py                  # Scenario, EpisodeLog, RewardCalculator,
β”‚   β”‚                            # StepRewardTransform, OpenEnvRewardCalculator
β”‚   β”œβ”€β”€ checks.py                # SpreadsheetChecker (episode-level)
β”‚   └── transforms.py            # SpreadsheetStepTransform (per-step)
β”‚
β”œβ”€β”€ scenarios/                   # Scenario definitions + JSON configs
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ definitions.py           # 12 Scenario objects (Python)
β”‚   └── *.json                   # Scenario board configs
β”‚
β”œβ”€β”€ agent/                       # LLM agent runner
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ llm.py                   # LiteLLM wrapper
β”‚   └── runner.py                # AgentRunner (gym-agnostic)
β”‚
β”œβ”€β”€ server/                      # OpenEnv environment server
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py                   # FastAPI + FastMCP server
β”‚   β”œβ”€β”€ spreadsheet_environment.py  # MCPEnvironment implementation
β”‚   β”œβ”€β”€ workbook_engine.py       # Workbook engine (openpyxl)
β”‚   β”œβ”€β”€ formula_utils.py         # Formula evaluation
β”‚   β”œβ”€β”€ scenario_loader.py       # Scenario JSON loader
β”‚   └── Dockerfile               # Server-only Dockerfile
β”‚
β”œβ”€β”€ workbooks/                   # Workbook files
β”‚   β”œβ”€β”€ templates/               # Base workbook templates
β”‚   β”œβ”€β”€ fixtures/                # Test fixture workbooks
β”‚   └── hidden_tests/            # Hidden test check definitions
β”‚
└── outputs/                     # Evaluation outputs (gitignored)
    β”œβ”€β”€ results/                 # Markdown result files
    └── trajectories/            # JSON trajectory files

Configuration (.env)

Copy .env.example to .env and fill in your API keys:

cp .env.example .env
# Edit .env with your API keys

LLM API Keys

Variable Required For Description
OPENAI_API_KEY gpt-4o, gpt-5.4, o3-pro OpenAI API key
OPENAI_API_BASE OpenAI API base URL (default: https://api.openai.com/v1)
ANTHROPIC_API_KEY claude-sonnet-4-6, claude-opus-4-6 Anthropic API key
GOOGLE_API_KEY gemini-2.5-pro Google AI API key

Only the key for your chosen --model provider is required. For local models via Ollama, no key is needed.

LLM Defaults

Variable Default Description
LLM_MODEL gpt-4o Default model when --model is not specified
LLM_TEMPERATURE 0.0 Default sampling temperature
LLM_MAX_TOKENS 1024 Default max tokens per response

Environment Server

Variable Default Description
OPENENV_PORT 8000 OpenEnv server port (exposed)
MAX_CONCURRENT_ENVS 8 Max parallel evaluation sessions
ENABLE_WEB_INTERFACE true Enable HF Spaces web UI
WORKBOOKS_DIR workbooks Workbook files directory
SCENARIOS_DIR scenarios Scenario JSON directory

Concurrent Sessions

Each evaluation session gets its own isolated workbook engine instance. Multiple agents can evaluate simultaneously against the same Docker container without interference.