Spaces:

huzzle-labs
/

spreadsheet

Sleeping

App Files Files Community

spreadsheet / README.md

kdemon1011

Upload folder using huggingface_hub

fded8f2 verified 7 days ago

preview code

raw

history blame contribute delete

19.7 kB

metadata

title: Spreadsheet Environment
emoji: 📊
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
license: mit
app_port: 8000
base_path: /web
tags:
  - openenv
  - openenv-0.2.3
  - rl-environment

Spreadsheet Gym

Exact workbook manipulation and reasoning over realistic spreadsheet tasks.

An OpenEnv RL environment where agents must read, understand, and edit real .xlsx workbooks to solve structured tasks — formula repair, cross-sheet lookups, ledger reconciliation, messy data extraction, and more. Designed to stress structured state tracking, cross-sheet reasoning, non-standard table layouts, and exact edit correctness — areas where frontier LLMs consistently struggle.

Playground Quick Start

Use the Playground panel (right side) to interact with the environment. Type a Tool Name and Arguments Json, then click Step.

Typical workflow

Click Reset to start a fresh session
Enter list_tools (args: {}) → discover all available tools and their parameters
Enter list_scenarios (args: {}) → see all 12 scenarios
Enter load_scenario (args: {"scenario_id": "formula_repair_01"}) → start a task
Enter list_sheets (args: {}) → see all sheets in the workbook
Enter read_range (args: {"sheet": "Summary", "range": "A1:F10"}) → read cell values
Enter inspect_formula (args: {"sheet": "Summary", "cell": "C5"}) → see the raw formula
Enter write_cell (args: {"sheet": "Summary", "cell": "C5", "value": "=SUM(B2:B10)"}) → fix a formula
Enter validate_partial (args: {}) → check how many hidden tests pass so far
Enter submit_workbook (args: {}) → submit for final evaluation (ends the task)

All tool commands (copy-paste ready)

Discovery & session tools

Tool Name	Arguments Json	Description
`list_tools`	`{}`	List every available tool with its parameters and types
`get_session_info`	`{}`	Current session ID, loaded scenario, step count, edit count, solve status
`list_scenarios`	`{}`	List all 12 scenarios with description, workbook name, and max steps
`load_scenario`	`{"scenario_id": "formula_repair_01"}`	Load a scenario and its workbook to begin working
`reset_scenario`	`{}`	Restore workbook to original state, keeping the scenario loaded

Reading tools

Tool Name	Arguments Json	Description
`list_sheets`	`{}`	List all sheets with names, dimensions, and visibility
`read_range`	`{"sheet": "Summary", "range": "B2:D10"}`	Read a rectangular range of cells (formulas shown as strings)
`inspect_formula`	`{"sheet": "Summary", "cell": "C15"}`	Return the raw formula string from a cell
`list_named_targets`	`{}`	Show target areas and allowed output zones for the scenario

Writing tools

Tool Name	Arguments Json	Description
`write_cell`	`{"sheet": "Summary", "cell": "C15", "value": "=SUM(B2:B10)"}`	Write a value or formula to a single cell
`write_range`	`{"sheet": "Summary", "start_cell": "A1", "data": "[[1, 2], [3, 4]]"}`	Write a 2D block of values starting from a cell

Note: write_range takes start_cell (not cell). The data argument is a JSON string of a 2D array.

Validation & submission tools

Tool Name	Arguments Json	Description
`validate_partial`	`{}`	Check partial progress — how many hidden tests pass/fail (no answers revealed)
`submit_workbook`	`{}`	Submit for final evaluation — returns pass rate and per-check results

History tools

Tool Name	Arguments Json	Description
`get_edit_history`	`{}`	Full list of cell edits: sheet, cell, value, step number

Important notes

All string parameters are required — no optional arguments on any tool
write_cell values starting with = are treated as formulas (e.g. "=VLOOKUP(A2,Sheet2!A:B,2,FALSE)")
write_range data must be a JSON string: "[[1, 2], [3, 4]]" not [[1, 2], [3, 4]]
Writing outside target regions incurs a reward penalty
Use validate_partial before submit_workbook to check progress without ending the task

Run locally

cd spreadsheet
pip install -e .

# Start the environment server
docker build -t openenv-spreadsheet -f Dockerfile .
docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet

# Verify it's running
curl http://localhost:8000/health

# Open the playground in your browser
open http://localhost:8000/web/

Hugging Face Space Deployment

This Space is built from OpenEnV environment spreadsheet.

Space URL: https://huggingface.co/spaces/huzzle-labs/spreadsheet
OpenEnV pinned ref: 0.2.3
Hub tag: openenv

Connecting from Code

Connect using the SpreadsheetEnv client:

from spreadsheet import SpreadsheetAction, SpreadsheetEnv

with SpreadsheetEnv.from_env("huzzle-labs/spreadsheet") as env:
    obs = env.reset()
    obs = await env.step(SpreadsheetAction(
        tool_name="list_scenarios",
        arguments_json="{}"
    ))
    obs = await env.step(SpreadsheetAction(
        tool_name="load_scenario",
        arguments_json='{"scenario_id": "formula_repair_01"}'
    ))
    obs = await env.step(SpreadsheetAction(
        tool_name="read_range",
        arguments_json='{"sheet": "Summary", "range": "A1:F10"}'
    ))

Or connect directly to a running server:

env = SpreadsheetEnv(base_url="https://huzzle-labs-spreadsheet.hf.space")

What Is This Gym?

The Spreadsheet gym gives an LLM agent a real .xlsx workbook and a task description. The agent must use MCP tools to read sheets, understand the structure, write values or formulas, and submit the workbook for automated evaluation against hidden test checks. Every edit is tracked, and the agent must stay within target regions and step budgets.

Unlike typical code-generation or QA benchmarks, this gym requires:

Structured state tracking — understanding multi-sheet workbook layouts with varying column structures
Cross-sheet reasoning — performing lookups, aggregations, and reconciliations across sheets
Exact edit correctness — writing precise formulas and values that pass deterministic hidden tests
Strategic tool use — using validate_partial to check progress before committing with submit_workbook

Task Families (12 Scenarios)

Formula Repair (2 scenarios)

Fix broken formulas in multi-department workbooks. Diagnose incorrect references, cascading errors, and wrong aggregation functions.

Scenario	Description	Max Steps
`formula_repair_01`	Fix broken formulas in a multi-department budget workbook	50
`formula_repair_02`	Fix cascading formula errors in a 5-year financial projection	50

Cross-Sheet Lookup (2 scenarios)

Aggregate data across multiple sheets using lookups and cross-references.

Scenario	Description	Max Steps
`cross_sheet_lookup_01`	Aggregate product revenue by region/category across quarterly sheets	50
`cross_sheet_lookup_02`	Calculate employee bonuses by cross-referencing Employees and Bonus_Tiers	50

Conditional Aggregation (2 scenarios)

Apply tiered calculations with conditional logic and priority-based allocation.

Scenario	Description	Max Steps
`conditional_aggregation_01`	Calculate tiered sales commissions for 15 salespeople	50
`conditional_aggregation_02`	Allocate a fixed budget across 20 requests with priority-based rates	50

Ledger Reconciliation (2 scenarios)

Match and reconcile transactions across bank statements and internal ledgers.

Scenario	Description	Max Steps
`ledger_reconciliation_01`	Reconcile bank statement against internal ledger — find mismatches	50
`ledger_reconciliation_02`	Reconcile USD and EUR transaction sheets into a unified summary	50

Messy Table Extraction (1 scenario)

Extract and clean data from poorly formatted raw exports.

Scenario	Description	Max Steps
`messy_table_extraction_01`	Extract/clean invoice data from messy export with mixed formats	50

Range Transformation (1 scenario)

Reshape and pivot data between long-format and wide-format layouts.

Scenario	Description	Max Steps
`range_transformation_01`	Pivot long-format employee metrics into wide-format table	50

Schedule Grid Fill (1 scenario)

Fill structured grids respecting constraints and rules.

Scenario	Description	Max Steps
`schedule_grid_fill_01`	Fill employee schedule grid for 12 employees × 7 days	50

Buggy Template Fix (1 scenario)

Debug template workbooks with multiple interacting formula errors.

Scenario	Description	Max Steps
`buggy_template_fix_01`	Debug quarterly financial report template with broken Annual_Summary	50

Architecture

┌─────────────────────────────────────────┐
│           OpenEnv Server (:8000)        │
│  ┌────────────┐  ┌───────────────────┐  │
│  │  FastMCP   │──│ SpreadsheetEnv    │  │
│  │ (13 tools) │  │ (MCPEnvironment)  │  │
│  └────────────┘  └────────┬──────────┘  │
│                           │             │
│              ┌────────────┼──────────┐  │
│              │  Workbook  │ Scenario │  │
│              │  Engine    │  Loader  │  │
│              │ (openpyxl) │          │  │
│              └────────────┴──────────┘  │
└─────────────────────────────────────────┘

All state is in-memory per session. No database, no external APIs. The workbook engine manages .xlsx files via openpyxl, tracks edits, and evaluates hidden tests. Formula evaluation uses the formulas library.

MCP Tools (13 total)

Session Management (4 tools)

Tool	Description
`get_session_info`	Get current session metadata (scenario, step count, edit count, solved)
`list_scenarios`	List all available scenarios with description and max steps
`load_scenario`	Load a scenario and its workbook by ID
`reset_scenario`	Restore workbook to original state (scenario stays loaded)

Reading (4 tools)

Tool	Description
`list_sheets`	List all sheets with names, dimensions, visibility
`read_range`	Read cells from a sheet in A1 notation (formulas as strings)
`inspect_formula`	Get raw formula string from a specific cell
`list_named_targets`	Show allowed output zones for the scenario

Writing (2 tools)

Tool	Description
`write_cell`	Write a value or formula to a single cell
`write_range`	Write a 2D block of values starting from a cell

Validation & Submission (2 tools)

Tool	Description
`validate_partial`	Check progress against hidden tests without revealing answers
`submit_workbook`	Submit for final evaluation (pass rate + per-check results)

History (1 tool)

Tool	Description
`get_edit_history`	Full edit log with sheet, cell, value, step number

Reward System

This gym ships with two reward modes, selectable via --reward-mode:

Custom Rewards — Episode-Level (`rewards/checks.py`)

The SpreadsheetChecker verifies ground truth from the episode trajectory and computes a weighted score:

Component	Weight	Description
`quality`	0.25	F1 of expected vs used tools + success rate
`efficiency`	0.15	`1.0 - (actual_steps / max_steps)` — fewer steps = higher
`ground_truth`	0.60	Hidden test pass rate from `submit_workbook`
`penalty`	variable	-0.5 (all calls succeed but 0% GT) or -0.2 (<30% GT)

total = 0.25 × quality + 0.15 × efficiency + 0.60 × ground_truth + penalty

from rewards.checks import SpreadsheetChecker

checker = SpreadsheetChecker()
checker.set_episode(episode)
reward = checker.compute_episode_reward()
# {'quality': 0.72, 'efficiency': 0.65, 'ground_truth': 0.80, ..., 'total': 0.68}

The base RewardCalculator (rewards/base.py) wraps this into the standard 3-component formula used across all gyms.

OpenEnV Transforms — Per-Step (`rewards/transforms.py`)

The SpreadsheetStepTransform provides fine-grained per-step rewards for RL training (GRPO). Each tool call receives a reward based on its outcome:

Tool	Success	Failure
`read_range` / `list_sheets`	0.0 (neutral)	0.0
`inspect_formula`	+0.05	0.0
`validate_partial` (improved)	+0.10	+0.05
`write_cell` / `write_range` (in target, after read)	+0.10	-0.10
`write_cell` / `write_range` (out of target)	-0.10	-0.10
`submit_workbook` (100% pass)	+0.50	—
`submit_workbook` (>50% pass)	+0.20	—
`submit_workbook` (<30% pass)	-0.10	—

from rewards.transforms import SpreadsheetStepTransform

transform = SpreadsheetStepTransform()
scored_obs = transform(observation)
print(scored_obs.reward)  # e.g., +0.10 for a write in target after reading

The OpenEnvRewardCalculator (rewards/base.py) combines per-step rewards with ground truth into the same weighted formula, using sign-based quality scoring.

Evaluation

The included run_eval.py runs an LLM agent against scenarios and scores results.

Quick Start

cd spreadsheet
pip install -e .

# Build and run the environment
docker build -t openenv-spreadsheet -f Dockerfile .
docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet

# Verify
curl http://localhost:8000/health

# Evaluate (single model, custom rewards)
python run_eval.py --model gpt-5.4 --save --trajectory

# Evaluate (multiple models, per-step rewards)
python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 \
  --parallel 3 --reward-mode openenv --save --trajectory

# Evaluate a specific scenario
python run_eval.py --model gpt-5.4 --scenario formula_repair_01

# Cleanup
docker stop spreadsheet && docker rm spreadsheet

Output Paths

Output	Path
Results markdown	`outputs/results/<run_id>.md`
Trajectory JSON	`outputs/trajectories/<run_id>/<model>.json`

Results files append per-model sections so you can accumulate multiple model runs in one file.

CLI Arguments

Argument	Default	Description
`--model`	`gpt-4o`	LiteLLM model string (comma-separated for parallel)
`--scenario`	all	Run a specific scenario by ID
`--reward-mode`	`custom`	`custom` (episode-level) or `openenv` (per-step)
`--parallel`	`1`	Number of models to run in parallel
`--save`	off	Save results markdown
`--trajectory`	off	Save trajectory JSON
`--temperature`	`0.0`	LLM sampling temperature
`--max-tokens`	`1024`	Max tokens per LLM response
`--run-id`	auto	Run identifier for grouping outputs
`--verbose`	off	Enable debug logging

Project Structure

spreadsheet/
├── __init__.py                  # Package exports (env + rewards)
├── client.py                    # OpenEnv client integration
├── models.py                    # Action/Observation data models
├── openenv.yaml                 # OpenEnv AutoEnv manifest
├── pyproject.toml               # Dependencies (openenv-core v0.2.3)
├── Dockerfile                   # Root Dockerfile for HF Spaces
├── .dockerignore
├── run_eval.py                  # LLM evaluation runner
│
├── rewards/                     # Reward system (both modes)
│   ├── __init__.py
│   ├── base.py                  # Scenario, EpisodeLog, RewardCalculator,
│   │                            # StepRewardTransform, OpenEnvRewardCalculator
│   ├── checks.py                # SpreadsheetChecker (episode-level)
│   └── transforms.py            # SpreadsheetStepTransform (per-step)
│
├── scenarios/                   # Scenario definitions + JSON configs
│   ├── __init__.py
│   ├── definitions.py           # 12 Scenario objects (Python)
│   └── *.json                   # Scenario board configs
│
├── agent/                       # LLM agent runner
│   ├── __init__.py
│   ├── llm.py                   # LiteLLM wrapper
│   └── runner.py                # AgentRunner (gym-agnostic)
│
├── server/                      # OpenEnv environment server
│   ├── __init__.py
│   ├── app.py                   # FastAPI + FastMCP server
│   ├── spreadsheet_environment.py  # MCPEnvironment implementation
│   ├── workbook_engine.py       # Workbook engine (openpyxl)
│   ├── formula_utils.py         # Formula evaluation
│   ├── scenario_loader.py       # Scenario JSON loader
│   └── Dockerfile               # Server-only Dockerfile
│
├── workbooks/                   # Workbook files
│   ├── templates/               # Base workbook templates
│   ├── fixtures/                # Test fixture workbooks
│   └── hidden_tests/            # Hidden test check definitions
│
└── outputs/                     # Evaluation outputs (gitignored)
    ├── results/                 # Markdown result files
    └── trajectories/            # JSON trajectory files

Configuration (.env)

Copy .env.example to .env and fill in your API keys:

cp .env.example .env
# Edit .env with your API keys

LLM API Keys

Variable	Required For	Description
`OPENAI_API_KEY`	`gpt-4o`, `gpt-5.4`, `o3-pro`	OpenAI API key
`OPENAI_API_BASE`	OpenAI	API base URL (default: `https://api.openai.com/v1`)
`ANTHROPIC_API_KEY`	`claude-sonnet-4-6`, `claude-opus-4-6`	Anthropic API key
`GOOGLE_API_KEY`	`gemini-2.5-pro`	Google AI API key

Only the key for your chosen --model provider is required. For local models via Ollama, no key is needed.

LLM Defaults

Variable	Default	Description
`LLM_MODEL`	`gpt-4o`	Default model when `--model` is not specified
`LLM_TEMPERATURE`	`0.0`	Default sampling temperature
`LLM_MAX_TOKENS`	`1024`	Default max tokens per response

Environment Server

Variable	Default	Description
`OPENENV_PORT`	`8000`	OpenEnv server port (exposed)
`MAX_CONCURRENT_ENVS`	`8`	Max parallel evaluation sessions
`ENABLE_WEB_INTERFACE`	`true`	Enable HF Spaces web UI
`WORKBOOKS_DIR`	`workbooks`	Workbook files directory
`SCENARIOS_DIR`	`scenarios`	Scenario JSON directory

Concurrent Sessions

Each evaluation session gets its own isolated workbook engine instance. Multiple agents can evaluate simultaneously against the same Docker container without interference.