Spaces:
Sleeping
title: Spreadsheet Environment
emoji: π
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
license: mit
app_port: 8000
base_path: /web
tags:
- openenv
- openenv-0.2.3
- rl-environment
Spreadsheet Gym
Exact workbook manipulation and reasoning over realistic spreadsheet tasks.
An OpenEnv RL environment where agents must read, understand, and edit real .xlsx workbooks to solve structured tasks β formula repair, cross-sheet lookups, ledger reconciliation, messy data extraction, and more. Designed to stress structured state tracking, cross-sheet reasoning, non-standard table layouts, and exact edit correctness β areas where frontier LLMs consistently struggle.
Playground Quick Start
Use the Playground panel (right side) to interact with the environment. Type a Tool Name and Arguments Json, then click Step.
Typical workflow
- Click Reset to start a fresh session
- Enter
list_tools(args:{}) β discover all available tools and their parameters - Enter
list_scenarios(args:{}) β see all 12 scenarios - Enter
load_scenario(args:{"scenario_id": "formula_repair_01"}) β start a task - Enter
list_sheets(args:{}) β see all sheets in the workbook - Enter
read_range(args:{"sheet": "Summary", "range": "A1:F10"}) β read cell values - Enter
inspect_formula(args:{"sheet": "Summary", "cell": "C5"}) β see the raw formula - Enter
write_cell(args:{"sheet": "Summary", "cell": "C5", "value": "=SUM(B2:B10)"}) β fix a formula - Enter
validate_partial(args:{}) β check how many hidden tests pass so far - Enter
submit_workbook(args:{}) β submit for final evaluation (ends the task)
All tool commands (copy-paste ready)
Discovery & session tools
| Tool Name | Arguments Json | Description |
|---|---|---|
list_tools |
{} |
List every available tool with its parameters and types |
get_session_info |
{} |
Current session ID, loaded scenario, step count, edit count, solve status |
list_scenarios |
{} |
List all 12 scenarios with description, workbook name, and max steps |
load_scenario |
{"scenario_id": "formula_repair_01"} |
Load a scenario and its workbook to begin working |
reset_scenario |
{} |
Restore workbook to original state, keeping the scenario loaded |
Reading tools
| Tool Name | Arguments Json | Description |
|---|---|---|
list_sheets |
{} |
List all sheets with names, dimensions, and visibility |
read_range |
{"sheet": "Summary", "range": "B2:D10"} |
Read a rectangular range of cells (formulas shown as strings) |
inspect_formula |
{"sheet": "Summary", "cell": "C15"} |
Return the raw formula string from a cell |
list_named_targets |
{} |
Show target areas and allowed output zones for the scenario |
Writing tools
| Tool Name | Arguments Json | Description |
|---|---|---|
write_cell |
{"sheet": "Summary", "cell": "C15", "value": "=SUM(B2:B10)"} |
Write a value or formula to a single cell |
write_range |
{"sheet": "Summary", "start_cell": "A1", "data": "[[1, 2], [3, 4]]"} |
Write a 2D block of values starting from a cell |
Note:
write_rangetakesstart_cell(notcell). Thedataargument is a JSON string of a 2D array.
Validation & submission tools
| Tool Name | Arguments Json | Description |
|---|---|---|
validate_partial |
{} |
Check partial progress β how many hidden tests pass/fail (no answers revealed) |
submit_workbook |
{} |
Submit for final evaluation β returns pass rate and per-check results |
History tools
| Tool Name | Arguments Json | Description |
|---|---|---|
get_edit_history |
{} |
Full list of cell edits: sheet, cell, value, step number |
Important notes
- All string parameters are required β no optional arguments on any tool
write_cellvalues starting with=are treated as formulas (e.g."=VLOOKUP(A2,Sheet2!A:B,2,FALSE)")write_rangedata must be a JSON string:"[[1, 2], [3, 4]]"not[[1, 2], [3, 4]]- Writing outside target regions incurs a reward penalty
- Use
validate_partialbeforesubmit_workbookto check progress without ending the task
Run locally
cd spreadsheet
pip install -e .
# Start the environment server
docker build -t openenv-spreadsheet -f Dockerfile .
docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet
# Verify it's running
curl http://localhost:8000/health
# Open the playground in your browser
open http://localhost:8000/web/
Hugging Face Space Deployment
This Space is built from OpenEnV environment spreadsheet.
- Space URL:
https://huggingface.co/spaces/huzzle-labs/spreadsheet - OpenEnV pinned ref:
0.2.3 - Hub tag:
openenv
Connecting from Code
Connect using the SpreadsheetEnv client:
from spreadsheet import SpreadsheetAction, SpreadsheetEnv
with SpreadsheetEnv.from_env("huzzle-labs/spreadsheet") as env:
obs = env.reset()
obs = await env.step(SpreadsheetAction(
tool_name="list_scenarios",
arguments_json="{}"
))
obs = await env.step(SpreadsheetAction(
tool_name="load_scenario",
arguments_json='{"scenario_id": "formula_repair_01"}'
))
obs = await env.step(SpreadsheetAction(
tool_name="read_range",
arguments_json='{"sheet": "Summary", "range": "A1:F10"}'
))
Or connect directly to a running server:
env = SpreadsheetEnv(base_url="https://huzzle-labs-spreadsheet.hf.space")
What Is This Gym?
The Spreadsheet gym gives an LLM agent a real .xlsx workbook and a task description. The agent must use MCP tools to read sheets, understand the structure, write values or formulas, and submit the workbook for automated evaluation against hidden test checks. Every edit is tracked, and the agent must stay within target regions and step budgets.
Unlike typical code-generation or QA benchmarks, this gym requires:
- Structured state tracking β understanding multi-sheet workbook layouts with varying column structures
- Cross-sheet reasoning β performing lookups, aggregations, and reconciliations across sheets
- Exact edit correctness β writing precise formulas and values that pass deterministic hidden tests
- Strategic tool use β using
validate_partialto check progress before committing withsubmit_workbook
Task Families (12 Scenarios)
Formula Repair (2 scenarios)
Fix broken formulas in multi-department workbooks. Diagnose incorrect references, cascading errors, and wrong aggregation functions.
| Scenario | Description | Max Steps |
|---|---|---|
formula_repair_01 |
Fix broken formulas in a multi-department budget workbook | 50 |
formula_repair_02 |
Fix cascading formula errors in a 5-year financial projection | 50 |
Cross-Sheet Lookup (2 scenarios)
Aggregate data across multiple sheets using lookups and cross-references.
| Scenario | Description | Max Steps |
|---|---|---|
cross_sheet_lookup_01 |
Aggregate product revenue by region/category across quarterly sheets | 50 |
cross_sheet_lookup_02 |
Calculate employee bonuses by cross-referencing Employees and Bonus_Tiers | 50 |
Conditional Aggregation (2 scenarios)
Apply tiered calculations with conditional logic and priority-based allocation.
| Scenario | Description | Max Steps |
|---|---|---|
conditional_aggregation_01 |
Calculate tiered sales commissions for 15 salespeople | 50 |
conditional_aggregation_02 |
Allocate a fixed budget across 20 requests with priority-based rates | 50 |
Ledger Reconciliation (2 scenarios)
Match and reconcile transactions across bank statements and internal ledgers.
| Scenario | Description | Max Steps |
|---|---|---|
ledger_reconciliation_01 |
Reconcile bank statement against internal ledger β find mismatches | 50 |
ledger_reconciliation_02 |
Reconcile USD and EUR transaction sheets into a unified summary | 50 |
Messy Table Extraction (1 scenario)
Extract and clean data from poorly formatted raw exports.
| Scenario | Description | Max Steps |
|---|---|---|
messy_table_extraction_01 |
Extract/clean invoice data from messy export with mixed formats | 50 |
Range Transformation (1 scenario)
Reshape and pivot data between long-format and wide-format layouts.
| Scenario | Description | Max Steps |
|---|---|---|
range_transformation_01 |
Pivot long-format employee metrics into wide-format table | 50 |
Schedule Grid Fill (1 scenario)
Fill structured grids respecting constraints and rules.
| Scenario | Description | Max Steps |
|---|---|---|
schedule_grid_fill_01 |
Fill employee schedule grid for 12 employees Γ 7 days | 50 |
Buggy Template Fix (1 scenario)
Debug template workbooks with multiple interacting formula errors.
| Scenario | Description | Max Steps |
|---|---|---|
buggy_template_fix_01 |
Debug quarterly financial report template with broken Annual_Summary | 50 |
Architecture
βββββββββββββββββββββββββββββββββββββββββββ
β OpenEnv Server (:8000) β
β ββββββββββββββ βββββββββββββββββββββ β
β β FastMCP ββββ SpreadsheetEnv β β
β β (13 tools) β β (MCPEnvironment) β β
β ββββββββββββββ ββββββββββ¬βββββββββββ β
β β β
β ββββββββββββββΌβββββββββββ β
β β Workbook β Scenario β β
β β Engine β Loader β β
β β (openpyxl) β β β
β ββββββββββββββ΄βββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββ
All state is in-memory per session. No database, no external APIs. The workbook engine manages .xlsx files via openpyxl, tracks edits, and evaluates hidden tests. Formula evaluation uses the formulas library.
MCP Tools (13 total)
Session Management (4 tools)
| Tool | Description |
|---|---|
get_session_info |
Get current session metadata (scenario, step count, edit count, solved) |
list_scenarios |
List all available scenarios with description and max steps |
load_scenario |
Load a scenario and its workbook by ID |
reset_scenario |
Restore workbook to original state (scenario stays loaded) |
Reading (4 tools)
| Tool | Description |
|---|---|
list_sheets |
List all sheets with names, dimensions, visibility |
read_range |
Read cells from a sheet in A1 notation (formulas as strings) |
inspect_formula |
Get raw formula string from a specific cell |
list_named_targets |
Show allowed output zones for the scenario |
Writing (2 tools)
| Tool | Description |
|---|---|
write_cell |
Write a value or formula to a single cell |
write_range |
Write a 2D block of values starting from a cell |
Validation & Submission (2 tools)
| Tool | Description |
|---|---|
validate_partial |
Check progress against hidden tests without revealing answers |
submit_workbook |
Submit for final evaluation (pass rate + per-check results) |
History (1 tool)
| Tool | Description |
|---|---|
get_edit_history |
Full edit log with sheet, cell, value, step number |
Reward System
This gym ships with two reward modes, selectable via --reward-mode:
Custom Rewards β Episode-Level (rewards/checks.py)
The SpreadsheetChecker verifies ground truth from the episode trajectory and computes a weighted score:
| Component | Weight | Description |
|---|---|---|
quality |
0.25 | F1 of expected vs used tools + success rate |
efficiency |
0.15 | 1.0 - (actual_steps / max_steps) β fewer steps = higher |
ground_truth |
0.60 | Hidden test pass rate from submit_workbook |
penalty |
variable | -0.5 (all calls succeed but 0% GT) or -0.2 (<30% GT) |
total = 0.25 Γ quality + 0.15 Γ efficiency + 0.60 Γ ground_truth + penalty
from rewards.checks import SpreadsheetChecker
checker = SpreadsheetChecker()
checker.set_episode(episode)
reward = checker.compute_episode_reward()
# {'quality': 0.72, 'efficiency': 0.65, 'ground_truth': 0.80, ..., 'total': 0.68}
The base RewardCalculator (rewards/base.py) wraps this into the standard 3-component formula used across all gyms.
OpenEnV Transforms β Per-Step (rewards/transforms.py)
The SpreadsheetStepTransform provides fine-grained per-step rewards for RL training (GRPO). Each tool call receives a reward based on its outcome:
| Tool | Success | Failure |
|---|---|---|
read_range / list_sheets |
0.0 (neutral) | 0.0 |
inspect_formula |
+0.05 | 0.0 |
validate_partial (improved) |
+0.10 | +0.05 |
write_cell / write_range (in target, after read) |
+0.10 | -0.10 |
write_cell / write_range (out of target) |
-0.10 | -0.10 |
submit_workbook (100% pass) |
+0.50 | β |
submit_workbook (>50% pass) |
+0.20 | β |
submit_workbook (<30% pass) |
-0.10 | β |
from rewards.transforms import SpreadsheetStepTransform
transform = SpreadsheetStepTransform()
scored_obs = transform(observation)
print(scored_obs.reward) # e.g., +0.10 for a write in target after reading
The OpenEnvRewardCalculator (rewards/base.py) combines per-step rewards with ground truth into the same weighted formula, using sign-based quality scoring.
Evaluation
The included run_eval.py runs an LLM agent against scenarios and scores results.
Quick Start
cd spreadsheet
pip install -e .
# Build and run the environment
docker build -t openenv-spreadsheet -f Dockerfile .
docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet
# Verify
curl http://localhost:8000/health
# Evaluate (single model, custom rewards)
python run_eval.py --model gpt-5.4 --save --trajectory
# Evaluate (multiple models, per-step rewards)
python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 \
--parallel 3 --reward-mode openenv --save --trajectory
# Evaluate a specific scenario
python run_eval.py --model gpt-5.4 --scenario formula_repair_01
# Cleanup
docker stop spreadsheet && docker rm spreadsheet
Output Paths
| Output | Path |
|---|---|
| Results markdown | outputs/results/<run_id>.md |
| Trajectory JSON | outputs/trajectories/<run_id>/<model>.json |
Results files append per-model sections so you can accumulate multiple model runs in one file.
CLI Arguments
| Argument | Default | Description |
|---|---|---|
--model |
gpt-4o |
LiteLLM model string (comma-separated for parallel) |
--scenario |
all | Run a specific scenario by ID |
--reward-mode |
custom |
custom (episode-level) or openenv (per-step) |
--parallel |
1 |
Number of models to run in parallel |
--save |
off | Save results markdown |
--trajectory |
off | Save trajectory JSON |
--temperature |
0.0 |
LLM sampling temperature |
--max-tokens |
1024 |
Max tokens per LLM response |
--run-id |
auto | Run identifier for grouping outputs |
--verbose |
off | Enable debug logging |
Project Structure
spreadsheet/
βββ __init__.py # Package exports (env + rewards)
βββ client.py # OpenEnv client integration
βββ models.py # Action/Observation data models
βββ openenv.yaml # OpenEnv AutoEnv manifest
βββ pyproject.toml # Dependencies (openenv-core v0.2.3)
βββ Dockerfile # Root Dockerfile for HF Spaces
βββ .dockerignore
βββ run_eval.py # LLM evaluation runner
β
βββ rewards/ # Reward system (both modes)
β βββ __init__.py
β βββ base.py # Scenario, EpisodeLog, RewardCalculator,
β β # StepRewardTransform, OpenEnvRewardCalculator
β βββ checks.py # SpreadsheetChecker (episode-level)
β βββ transforms.py # SpreadsheetStepTransform (per-step)
β
βββ scenarios/ # Scenario definitions + JSON configs
β βββ __init__.py
β βββ definitions.py # 12 Scenario objects (Python)
β βββ *.json # Scenario board configs
β
βββ agent/ # LLM agent runner
β βββ __init__.py
β βββ llm.py # LiteLLM wrapper
β βββ runner.py # AgentRunner (gym-agnostic)
β
βββ server/ # OpenEnv environment server
β βββ __init__.py
β βββ app.py # FastAPI + FastMCP server
β βββ spreadsheet_environment.py # MCPEnvironment implementation
β βββ workbook_engine.py # Workbook engine (openpyxl)
β βββ formula_utils.py # Formula evaluation
β βββ scenario_loader.py # Scenario JSON loader
β βββ Dockerfile # Server-only Dockerfile
β
βββ workbooks/ # Workbook files
β βββ templates/ # Base workbook templates
β βββ fixtures/ # Test fixture workbooks
β βββ hidden_tests/ # Hidden test check definitions
β
βββ outputs/ # Evaluation outputs (gitignored)
βββ results/ # Markdown result files
βββ trajectories/ # JSON trajectory files
Configuration (.env)
Copy .env.example to .env and fill in your API keys:
cp .env.example .env
# Edit .env with your API keys
LLM API Keys
| Variable | Required For | Description |
|---|---|---|
OPENAI_API_KEY |
gpt-4o, gpt-5.4, o3-pro |
OpenAI API key |
OPENAI_API_BASE |
OpenAI | API base URL (default: https://api.openai.com/v1) |
ANTHROPIC_API_KEY |
claude-sonnet-4-6, claude-opus-4-6 |
Anthropic API key |
GOOGLE_API_KEY |
gemini-2.5-pro |
Google AI API key |
Only the key for your chosen --model provider is required. For local models via Ollama, no key is needed.
LLM Defaults
| Variable | Default | Description |
|---|---|---|
LLM_MODEL |
gpt-4o |
Default model when --model is not specified |
LLM_TEMPERATURE |
0.0 |
Default sampling temperature |
LLM_MAX_TOKENS |
1024 |
Default max tokens per response |
Environment Server
| Variable | Default | Description |
|---|---|---|
OPENENV_PORT |
8000 |
OpenEnv server port (exposed) |
MAX_CONCURRENT_ENVS |
8 |
Max parallel evaluation sessions |
ENABLE_WEB_INTERFACE |
true |
Enable HF Spaces web UI |
WORKBOOKS_DIR |
workbooks |
Workbook files directory |
SCENARIOS_DIR |
scenarios |
Scenario JSON directory |
Concurrent Sessions
Each evaluation session gets its own isolated workbook engine instance. Multiple agents can evaluate simultaneously against the same Docker container without interference.