spreadsheet / README.md
kdemon1011's picture
Upload folder using huggingface_hub
fded8f2 verified
---
title: "Spreadsheet Environment"
emoji: πŸ“Š
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
license: mit
app_port: 8000
base_path: /web
tags:
- openenv
- openenv-0.2.3
- rl-environment
---
# Spreadsheet Gym
**Exact workbook manipulation and reasoning over realistic spreadsheet tasks.**
An OpenEnv RL environment where agents must read, understand, and edit real `.xlsx` workbooks to solve structured tasks β€” formula repair, cross-sheet lookups, ledger reconciliation, messy data extraction, and more. Designed to stress structured state tracking, cross-sheet reasoning, non-standard table layouts, and exact edit correctness β€” areas where frontier LLMs consistently struggle.
## Playground Quick Start
Use the **Playground** panel (right side) to interact with the environment. Type a **Tool Name** and **Arguments Json**, then click **Step**.
### Typical workflow
1. Click **Reset** to start a fresh session
2. Enter `list_tools` (args: `{}`) β†’ discover all available tools and their parameters
3. Enter `list_scenarios` (args: `{}`) β†’ see all 12 scenarios
4. Enter `load_scenario` (args: `{"scenario_id": "formula_repair_01"}`) β†’ start a task
5. Enter `list_sheets` (args: `{}`) β†’ see all sheets in the workbook
6. Enter `read_range` (args: `{"sheet": "Summary", "range": "A1:F10"}`) β†’ read cell values
7. Enter `inspect_formula` (args: `{"sheet": "Summary", "cell": "C5"}`) β†’ see the raw formula
8. Enter `write_cell` (args: `{"sheet": "Summary", "cell": "C5", "value": "=SUM(B2:B10)"}`) β†’ fix a formula
9. Enter `validate_partial` (args: `{}`) β†’ check how many hidden tests pass so far
10. Enter `submit_workbook` (args: `{}`) β†’ submit for final evaluation (ends the task)
### All tool commands (copy-paste ready)
#### Discovery & session tools
| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `list_tools` | `{}` | List every available tool with its parameters and types |
| `get_session_info` | `{}` | Current session ID, loaded scenario, step count, edit count, solve status |
| `list_scenarios` | `{}` | List all 12 scenarios with description, workbook name, and max steps |
| `load_scenario` | `{"scenario_id": "formula_repair_01"}` | Load a scenario and its workbook to begin working |
| `reset_scenario` | `{}` | Restore workbook to original state, keeping the scenario loaded |
#### Reading tools
| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `list_sheets` | `{}` | List all sheets with names, dimensions, and visibility |
| `read_range` | `{"sheet": "Summary", "range": "B2:D10"}` | Read a rectangular range of cells (formulas shown as strings) |
| `inspect_formula` | `{"sheet": "Summary", "cell": "C15"}` | Return the raw formula string from a cell |
| `list_named_targets` | `{}` | Show target areas and allowed output zones for the scenario |
#### Writing tools
| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `write_cell` | `{"sheet": "Summary", "cell": "C15", "value": "=SUM(B2:B10)"}` | Write a value or formula to a single cell |
| `write_range` | `{"sheet": "Summary", "start_cell": "A1", "data": "[[1, 2], [3, 4]]"}` | Write a 2D block of values starting from a cell |
> **Note:** `write_range` takes `start_cell` (not `cell`). The `data` argument is a JSON string of a 2D array.
#### Validation & submission tools
| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `validate_partial` | `{}` | Check partial progress β€” how many hidden tests pass/fail (no answers revealed) |
| `submit_workbook` | `{}` | Submit for final evaluation β€” returns pass rate and per-check results |
#### History tools
| Tool Name | Arguments Json | Description |
|-----------|---------------|-------------|
| `get_edit_history` | `{}` | Full list of cell edits: sheet, cell, value, step number |
### Important notes
- All string parameters are required β€” no optional arguments on any tool
- `write_cell` values starting with `=` are treated as formulas (e.g. `"=VLOOKUP(A2,Sheet2!A:B,2,FALSE)"`)
- `write_range` data must be a JSON string: `"[[1, 2], [3, 4]]"` not `[[1, 2], [3, 4]]`
- Writing outside target regions incurs a reward penalty
- Use `validate_partial` before `submit_workbook` to check progress without ending the task
### Run locally
```bash
cd spreadsheet
pip install -e .
# Start the environment server
docker build -t openenv-spreadsheet -f Dockerfile .
docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet
# Verify it's running
curl http://localhost:8000/health
# Open the playground in your browser
open http://localhost:8000/web/
```
## Hugging Face Space Deployment
This Space is built from OpenEnV environment `spreadsheet`.
- **Space URL**: `https://huggingface.co/spaces/huzzle-labs/spreadsheet`
- **OpenEnV pinned ref**: `0.2.3`
- **Hub tag**: `openenv`
### Connecting from Code
Connect using the `SpreadsheetEnv` client:
```python
from spreadsheet import SpreadsheetAction, SpreadsheetEnv
with SpreadsheetEnv.from_env("huzzle-labs/spreadsheet") as env:
obs = env.reset()
obs = await env.step(SpreadsheetAction(
tool_name="list_scenarios",
arguments_json="{}"
))
obs = await env.step(SpreadsheetAction(
tool_name="load_scenario",
arguments_json='{"scenario_id": "formula_repair_01"}'
))
obs = await env.step(SpreadsheetAction(
tool_name="read_range",
arguments_json='{"sheet": "Summary", "range": "A1:F10"}'
))
```
Or connect directly to a running server:
```python
env = SpreadsheetEnv(base_url="https://huzzle-labs-spreadsheet.hf.space")
```
## What Is This Gym?
The Spreadsheet gym gives an LLM agent a real `.xlsx` workbook and a task description. The agent must use MCP tools to read sheets, understand the structure, write values or formulas, and submit the workbook for automated evaluation against hidden test checks. Every edit is tracked, and the agent must stay within target regions and step budgets.
Unlike typical code-generation or QA benchmarks, this gym requires:
- **Structured state tracking** β€” understanding multi-sheet workbook layouts with varying column structures
- **Cross-sheet reasoning** β€” performing lookups, aggregations, and reconciliations across sheets
- **Exact edit correctness** β€” writing precise formulas and values that pass deterministic hidden tests
- **Strategic tool use** β€” using `validate_partial` to check progress before committing with `submit_workbook`
## Task Families (12 Scenarios)
### Formula Repair (2 scenarios)
Fix broken formulas in multi-department workbooks. Diagnose incorrect references, cascading errors, and wrong aggregation functions.
| Scenario | Description | Max Steps |
|---|---|---|
| `formula_repair_01` | Fix broken formulas in a multi-department budget workbook | 50 |
| `formula_repair_02` | Fix cascading formula errors in a 5-year financial projection | 50 |
### Cross-Sheet Lookup (2 scenarios)
Aggregate data across multiple sheets using lookups and cross-references.
| Scenario | Description | Max Steps |
|---|---|---|
| `cross_sheet_lookup_01` | Aggregate product revenue by region/category across quarterly sheets | 50 |
| `cross_sheet_lookup_02` | Calculate employee bonuses by cross-referencing Employees and Bonus_Tiers | 50 |
### Conditional Aggregation (2 scenarios)
Apply tiered calculations with conditional logic and priority-based allocation.
| Scenario | Description | Max Steps |
|---|---|---|
| `conditional_aggregation_01` | Calculate tiered sales commissions for 15 salespeople | 50 |
| `conditional_aggregation_02` | Allocate a fixed budget across 20 requests with priority-based rates | 50 |
### Ledger Reconciliation (2 scenarios)
Match and reconcile transactions across bank statements and internal ledgers.
| Scenario | Description | Max Steps |
|---|---|---|
| `ledger_reconciliation_01` | Reconcile bank statement against internal ledger β€” find mismatches | 50 |
| `ledger_reconciliation_02` | Reconcile USD and EUR transaction sheets into a unified summary | 50 |
### Messy Table Extraction (1 scenario)
Extract and clean data from poorly formatted raw exports.
| Scenario | Description | Max Steps |
|---|---|---|
| `messy_table_extraction_01` | Extract/clean invoice data from messy export with mixed formats | 50 |
### Range Transformation (1 scenario)
Reshape and pivot data between long-format and wide-format layouts.
| Scenario | Description | Max Steps |
|---|---|---|
| `range_transformation_01` | Pivot long-format employee metrics into wide-format table | 50 |
### Schedule Grid Fill (1 scenario)
Fill structured grids respecting constraints and rules.
| Scenario | Description | Max Steps |
|---|---|---|
| `schedule_grid_fill_01` | Fill employee schedule grid for 12 employees Γ— 7 days | 50 |
### Buggy Template Fix (1 scenario)
Debug template workbooks with multiple interacting formula errors.
| Scenario | Description | Max Steps |
|---|---|---|
| `buggy_template_fix_01` | Debug quarterly financial report template with broken Annual_Summary | 50 |
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OpenEnv Server (:8000) β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ FastMCP │──│ SpreadsheetEnv β”‚ β”‚
β”‚ β”‚ (13 tools) β”‚ β”‚ (MCPEnvironment) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Workbook β”‚ Scenario β”‚ β”‚
β”‚ β”‚ Engine β”‚ Loader β”‚ β”‚
β”‚ β”‚ (openpyxl) β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
All state is in-memory per session. No database, no external APIs. The workbook engine manages `.xlsx` files via openpyxl, tracks edits, and evaluates hidden tests. Formula evaluation uses the `formulas` library.
## MCP Tools (13 total)
### Session Management (4 tools)
| Tool | Description |
|------|-------------|
| `get_session_info` | Get current session metadata (scenario, step count, edit count, solved) |
| `list_scenarios` | List all available scenarios with description and max steps |
| `load_scenario` | Load a scenario and its workbook by ID |
| `reset_scenario` | Restore workbook to original state (scenario stays loaded) |
### Reading (4 tools)
| Tool | Description |
|------|-------------|
| `list_sheets` | List all sheets with names, dimensions, visibility |
| `read_range` | Read cells from a sheet in A1 notation (formulas as strings) |
| `inspect_formula` | Get raw formula string from a specific cell |
| `list_named_targets` | Show allowed output zones for the scenario |
### Writing (2 tools)
| Tool | Description |
|------|-------------|
| `write_cell` | Write a value or formula to a single cell |
| `write_range` | Write a 2D block of values starting from a cell |
### Validation & Submission (2 tools)
| Tool | Description |
|------|-------------|
| `validate_partial` | Check progress against hidden tests without revealing answers |
| `submit_workbook` | Submit for final evaluation (pass rate + per-check results) |
### History (1 tool)
| Tool | Description |
|------|-------------|
| `get_edit_history` | Full edit log with sheet, cell, value, step number |
## Reward System
This gym ships with **two** reward modes, selectable via `--reward-mode`:
### Custom Rewards β€” Episode-Level (`rewards/checks.py`)
The `SpreadsheetChecker` verifies ground truth from the episode trajectory and computes a weighted score:
| Component | Weight | Description |
|---|---|---|
| `quality` | 0.25 | F1 of expected vs used tools + success rate |
| `efficiency` | 0.15 | `1.0 - (actual_steps / max_steps)` β€” fewer steps = higher |
| `ground_truth` | 0.60 | Hidden test pass rate from `submit_workbook` |
| `penalty` | variable | -0.5 (all calls succeed but 0% GT) or -0.2 (<30% GT) |
```
total = 0.25 Γ— quality + 0.15 Γ— efficiency + 0.60 Γ— ground_truth + penalty
```
```python
from rewards.checks import SpreadsheetChecker
checker = SpreadsheetChecker()
checker.set_episode(episode)
reward = checker.compute_episode_reward()
# {'quality': 0.72, 'efficiency': 0.65, 'ground_truth': 0.80, ..., 'total': 0.68}
```
The base `RewardCalculator` (`rewards/base.py`) wraps this into the standard 3-component formula used across all gyms.
### OpenEnV Transforms β€” Per-Step (`rewards/transforms.py`)
The `SpreadsheetStepTransform` provides fine-grained per-step rewards for RL training (GRPO). Each tool call receives a reward based on its outcome:
| Tool | Success | Failure |
|---|---|---|
| `read_range` / `list_sheets` | 0.0 (neutral) | 0.0 |
| `inspect_formula` | +0.05 | 0.0 |
| `validate_partial` (improved) | +0.10 | +0.05 |
| `write_cell` / `write_range` (in target, after read) | +0.10 | -0.10 |
| `write_cell` / `write_range` (out of target) | -0.10 | -0.10 |
| `submit_workbook` (100% pass) | +0.50 | β€” |
| `submit_workbook` (>50% pass) | +0.20 | β€” |
| `submit_workbook` (<30% pass) | -0.10 | β€” |
```python
from rewards.transforms import SpreadsheetStepTransform
transform = SpreadsheetStepTransform()
scored_obs = transform(observation)
print(scored_obs.reward) # e.g., +0.10 for a write in target after reading
```
The `OpenEnvRewardCalculator` (`rewards/base.py`) combines per-step rewards with ground truth into the same weighted formula, using sign-based quality scoring.
## Evaluation
The included `run_eval.py` runs an LLM agent against scenarios and scores results.
### Quick Start
```bash
cd spreadsheet
pip install -e .
# Build and run the environment
docker build -t openenv-spreadsheet -f Dockerfile .
docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet
# Verify
curl http://localhost:8000/health
# Evaluate (single model, custom rewards)
python run_eval.py --model gpt-5.4 --save --trajectory
# Evaluate (multiple models, per-step rewards)
python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 \
--parallel 3 --reward-mode openenv --save --trajectory
# Evaluate a specific scenario
python run_eval.py --model gpt-5.4 --scenario formula_repair_01
# Cleanup
docker stop spreadsheet && docker rm spreadsheet
```
### Output Paths
| Output | Path |
|---|---|
| Results markdown | `outputs/results/<run_id>.md` |
| Trajectory JSON | `outputs/trajectories/<run_id>/<model>.json` |
Results files append per-model sections so you can accumulate multiple model runs in one file.
### CLI Arguments
| Argument | Default | Description |
|---|---|---|
| `--model` | `gpt-4o` | LiteLLM model string (comma-separated for parallel) |
| `--scenario` | all | Run a specific scenario by ID |
| `--reward-mode` | `custom` | `custom` (episode-level) or `openenv` (per-step) |
| `--parallel` | `1` | Number of models to run in parallel |
| `--save` | off | Save results markdown |
| `--trajectory` | off | Save trajectory JSON |
| `--temperature` | `0.0` | LLM sampling temperature |
| `--max-tokens` | `1024` | Max tokens per LLM response |
| `--run-id` | auto | Run identifier for grouping outputs |
| `--verbose` | off | Enable debug logging |
## Project Structure
```
spreadsheet/
β”œβ”€β”€ __init__.py # Package exports (env + rewards)
β”œβ”€β”€ client.py # OpenEnv client integration
β”œβ”€β”€ models.py # Action/Observation data models
β”œβ”€β”€ openenv.yaml # OpenEnv AutoEnv manifest
β”œβ”€β”€ pyproject.toml # Dependencies (openenv-core v0.2.3)
β”œβ”€β”€ Dockerfile # Root Dockerfile for HF Spaces
β”œβ”€β”€ .dockerignore
β”œβ”€β”€ run_eval.py # LLM evaluation runner
β”‚
β”œβ”€β”€ rewards/ # Reward system (both modes)
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ base.py # Scenario, EpisodeLog, RewardCalculator,
β”‚ β”‚ # StepRewardTransform, OpenEnvRewardCalculator
β”‚ β”œβ”€β”€ checks.py # SpreadsheetChecker (episode-level)
β”‚ └── transforms.py # SpreadsheetStepTransform (per-step)
β”‚
β”œβ”€β”€ scenarios/ # Scenario definitions + JSON configs
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ definitions.py # 12 Scenario objects (Python)
β”‚ └── *.json # Scenario board configs
β”‚
β”œβ”€β”€ agent/ # LLM agent runner
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ llm.py # LiteLLM wrapper
β”‚ └── runner.py # AgentRunner (gym-agnostic)
β”‚
β”œβ”€β”€ server/ # OpenEnv environment server
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ app.py # FastAPI + FastMCP server
β”‚ β”œβ”€β”€ spreadsheet_environment.py # MCPEnvironment implementation
β”‚ β”œβ”€β”€ workbook_engine.py # Workbook engine (openpyxl)
β”‚ β”œβ”€β”€ formula_utils.py # Formula evaluation
β”‚ β”œβ”€β”€ scenario_loader.py # Scenario JSON loader
β”‚ └── Dockerfile # Server-only Dockerfile
β”‚
β”œβ”€β”€ workbooks/ # Workbook files
β”‚ β”œβ”€β”€ templates/ # Base workbook templates
β”‚ β”œβ”€β”€ fixtures/ # Test fixture workbooks
β”‚ └── hidden_tests/ # Hidden test check definitions
β”‚
└── outputs/ # Evaluation outputs (gitignored)
β”œβ”€β”€ results/ # Markdown result files
└── trajectories/ # JSON trajectory files
```
## Configuration (.env)
Copy `.env.example` to `.env` and fill in your API keys:
```bash
cp .env.example .env
# Edit .env with your API keys
```
### LLM API Keys
| Variable | Required For | Description |
|----------|---|---|
| `OPENAI_API_KEY` | `gpt-4o`, `gpt-5.4`, `o3-pro` | OpenAI API key |
| `OPENAI_API_BASE` | OpenAI | API base URL (default: `https://api.openai.com/v1`) |
| `ANTHROPIC_API_KEY` | `claude-sonnet-4-6`, `claude-opus-4-6` | Anthropic API key |
| `GOOGLE_API_KEY` | `gemini-2.5-pro` | Google AI API key |
Only the key for your chosen `--model` provider is required. For local models via Ollama, no key is needed.
### LLM Defaults
| Variable | Default | Description |
|----------|---------|-------------|
| `LLM_MODEL` | `gpt-4o` | Default model when `--model` is not specified |
| `LLM_TEMPERATURE` | `0.0` | Default sampling temperature |
| `LLM_MAX_TOKENS` | `1024` | Default max tokens per response |
### Environment Server
| Variable | Default | Description |
|----------|---------|-------------|
| `OPENENV_PORT` | `8000` | OpenEnv server port (exposed) |
| `MAX_CONCURRENT_ENVS` | `8` | Max parallel evaluation sessions |
| `ENABLE_WEB_INTERFACE` | `true` | Enable HF Spaces web UI |
| `WORKBOOKS_DIR` | `workbooks` | Workbook files directory |
| `SCENARIOS_DIR` | `scenarios` | Scenario JSON directory |
## Concurrent Sessions
Each evaluation session gets its own isolated workbook engine instance. Multiple agents can evaluate simultaneously against the same Docker container without interference.