Spaces:

huzzle-labs
/

spreadsheet

Sleeping

App Files Files Community

kdemon1011 commited on 11 days ago

Commit

fded8f2

verified ·

1 Parent(s): 6b4e5a8

Upload folder using huggingface_hub

Browse files

Files changed (21) hide show

.dockerignore +37 -0
.env.example +28 -0
Dockerfile +8 -0
README.md +435 -41
__init__.py +17 -2
agent/__init__.py +4 -0
agent/llm.py +114 -0
agent/runner.py +282 -0
pyproject.toml +19 -3
rewards/__init__.py +23 -0
rewards/base.py +313 -0
rewards/checks.py +230 -0
rewards/transforms.py +191 -0
run_eval.py +820 -0
scenarios/__init__.py +3 -0
scenarios/definitions.py +69 -0
server/Dockerfile +41 -0
server/app.py +80 -0
workbooks/fixtures/13597ec4-95ae-4293-a2d1-aec276ac80e9_sales_commission.xlsx +0 -0
workbooks/fixtures/6dd7822d-39b9-4134-80ad-b7e653ad9944_product_revenue_by_region.xlsx +0 -0
workbooks/fixtures/8df5e07f-7a7d-4911-86bd-2e102df0cc7b_multi_department_budget.xlsx +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,37 @@

+# Secrets — NEVER include in Docker image
+.env
+.env.local
+.env.production
+# Python artifacts
+__pycache__/
+*.pyc
+*.pyo
+*.egg-info/
+.venv/
+# Git
+.git/
+.gitignore
+# Test / lint caches
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+# Evaluation outputs (internal, not part of the env)
+outputs/
+trajectories/
+results/
+comparison.md
+*.md.bak
+generate_scenarios.py
+# IDE
+.cursor/
+.vscode/
+.idea/
+# OS files
+.DS_Store
+Thumbs.db

.env.example ADDED Viewed

	@@ -0,0 +1,28 @@

+# ── Environment Server Configuration ──
+OPENENV_PORT=8000
+MAX_CONCURRENT_ENVS=8
+ENABLE_WEB_INTERFACE=true
+WORKBOOKS_DIR=workbooks
+SCENARIOS_DIR=scenarios
+# ── LLM Configuration (used by run_eval.py) ──
+LLM_MODEL=gpt-4o
+LLM_TEMPERATURE=0.0
+LLM_MAX_TOKENS=1024
+# ── API Keys ──
+# Only the key for your chosen --model provider is required.
+# OpenAI (for gpt-4o, gpt-5.4, o3-pro, etc.)
+OPENAI_API_KEY=
+OPENAI_API_BASE=https://api.openai.com/v1
+# Anthropic (for claude-sonnet-4-6, claude-opus-4-6, etc.)
+ANTHROPIC_API_KEY=
+# Google (for gemini-2.5-pro, etc.)
+GOOGLE_API_KEY=
+# For local models via Ollama — no key needed, just run:
+#   ollama serve && ollama pull llama3
+# Then use: --model ollama/llama3

Dockerfile CHANGED Viewed

@@ -32,6 +32,14 @@ ENV PYTHONPATH="/app/env:$PYTHONPATH"
 ENV ENABLE_WEB_INTERFACE=true
 ENV WORKBOOKS_DIR=/app/env/workbooks
 ENV SCENARIOS_DIR=/app/env/scenarios
 EXPOSE 8000

 ENV ENABLE_WEB_INTERFACE=true
 ENV WORKBOOKS_DIR=/app/env/workbooks
 ENV SCENARIOS_DIR=/app/env/scenarios
+ENV SPACE_ID=huzzle-labs/spreadsheet
+RUN python -c "\
+import re, pathlib;\
+src = pathlib.Path('/app/env/README.md').read_text();\
+clean = re.sub(r'^---\n.*?\n---\n', '', src, count=1, flags=re.DOTALL);\
+pathlib.Path('/app/env/.README_web.md').write_text(clean)"
+ENV ENV_README_PATH=/app/env/.README_web.md
 EXPOSE 8000

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: "Spreadsheet Environment Server"
 emoji: 📊
 colorFrom: green
 colorTo: blue
@@ -10,74 +10,468 @@ app_port: 8000
 base_path: /web
 tags:
   - openenv
   - rl-environment
 ---
-# Spreadsheet Environment
-Exact workbook manipulation and reasoning over realistic spreadsheet tasks. This gym targets weaknesses in structured state tracking, cross-sheet reasoning, non-standard table layouts, and exact edit correctness.
-## Quick Start
 ```bash
-cd spreadsheet && docker build -t openenv-spreadsheet -f server/Dockerfile .
 docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet
 curl http://localhost:8000/health
 ```
 ```python
-from spreadsheet import SpreadsheetEnv
-with SpreadsheetEnv(base_url="http://localhost:8000") as env:
-    result = env.reset()
-    # Use MCP tools: list_sheets, read_range, write_cell, submit_workbook, etc.
 ```
-## Project Structure
 ```
-spreadsheet/
-├── __init__.py
-├── client.py
-├── models.py
-├── openenv.yaml
-├── pyproject.toml
-├── README.md
-├── .env
-├── .dockerignore
-├── uv.lock
-├── server/
-│   ├── __init__.py
-│   ├── app.py
-│   ├── spreadsheet_environment.py
-│   ├── workbook_engine.py
-│   ├── formula_utils.py
-│   └── scenario_loader.py
-├── workbooks/
-│   ├── templates/
-│   ├── fixtures/
-│   └── hidden_tests/
-├── scenarios/
-└── server/Dockerfile
 ```
 ## Reward System
-Both reward modes use a unified scoring formula:
 ```
 total = 0.25 × quality + 0.15 × efficiency + 0.60 × ground_truth + penalty
 ```
-- **Quality (0.25)** — Custom mode: F1 of expected vs used tools + success rate. OpenEnv mode: fraction of non-neutral steps that were productive (sign-based).
-- **Efficiency (0.15)** — `1.0 - (actual_steps / max_steps)`. Fewer steps = higher score.
-- **Ground Truth (0.60)** — Outcome checks verified against submit_workbook hidden test results (pass rate of cell/formula checks).
-- **Penalty** — Graduated: -0.5 (all calls succeed, 0% ground truth) or -0.2 (<30% ground truth).
-See [Reward System](../docs/reward-system.md) for full details.
-## Deployment
 ```bash
-openenv push . --private --repo-id huzzle-labs/spreadsheet
 ```

 ---
+title: "Spreadsheet Environment"
 emoji: 📊
 colorFrom: green
 colorTo: blue
 base_path: /web
 tags:
   - openenv
+  - openenv-0.2.3
   - rl-environment
 ---
+# Spreadsheet Gym
+**Exact workbook manipulation and reasoning over realistic spreadsheet tasks.**
+An OpenEnv RL environment where agents must read, understand, and edit real `.xlsx` workbooks to solve structured tasks — formula repair, cross-sheet lookups, ledger reconciliation, messy data extraction, and more. Designed to stress structured state tracking, cross-sheet reasoning, non-standard table layouts, and exact edit correctness — areas where frontier LLMs consistently struggle.
+## Playground Quick Start
+Use the **Playground** panel (right side) to interact with the environment. Type a **Tool Name** and **Arguments Json**, then click **Step**.
+### Typical workflow
+1. Click **Reset** to start a fresh session
+2. Enter `list_tools` (args: `{}`) → discover all available tools and their parameters
+3. Enter `list_scenarios` (args: `{}`) → see all 12 scenarios
+4. Enter `load_scenario` (args: `{"scenario_id": "formula_repair_01"}`) → start a task
+5. Enter `list_sheets` (args: `{}`) → see all sheets in the workbook
+6. Enter `read_range` (args: `{"sheet": "Summary", "range": "A1:F10"}`) → read cell values
+7. Enter `inspect_formula` (args: `{"sheet": "Summary", "cell": "C5"}`) → see the raw formula
+8. Enter `write_cell` (args: `{"sheet": "Summary", "cell": "C5", "value": "=SUM(B2:B10)"}`) → fix a formula
+9. Enter `validate_partial` (args: `{}`) → check how many hidden tests pass so far
+10. Enter `submit_workbook` (args: `{}`) → submit for final evaluation (ends the task)
+### All tool commands (copy-paste ready)
+#### Discovery & session tools
+| Tool Name | Arguments Json | Description |
+|-----------|---------------|-------------|
+| `list_tools` | `{}` | List every available tool with its parameters and types |
+| `get_session_info` | `{}` | Current session ID, loaded scenario, step count, edit count, solve status |
+| `list_scenarios` | `{}` | List all 12 scenarios with description, workbook name, and max steps |
+| `load_scenario` | `{"scenario_id": "formula_repair_01"}` | Load a scenario and its workbook to begin working |
+| `reset_scenario` | `{}` | Restore workbook to original state, keeping the scenario loaded |
+#### Reading tools
+| Tool Name | Arguments Json | Description |
+|-----------|---------------|-------------|
+| `list_sheets` | `{}` | List all sheets with names, dimensions, and visibility |
+| `read_range` | `{"sheet": "Summary", "range": "B2:D10"}` | Read a rectangular range of cells (formulas shown as strings) |
+| `inspect_formula` | `{"sheet": "Summary", "cell": "C15"}` | Return the raw formula string from a cell |
+| `list_named_targets` | `{}` | Show target areas and allowed output zones for the scenario |
+#### Writing tools
+| Tool Name | Arguments Json | Description |
+|-----------|---------------|-------------|
+| `write_cell` | `{"sheet": "Summary", "cell": "C15", "value": "=SUM(B2:B10)"}` | Write a value or formula to a single cell |
+| `write_range` | `{"sheet": "Summary", "start_cell": "A1", "data": "[[1, 2], [3, 4]]"}` | Write a 2D block of values starting from a cell |
+> **Note:** `write_range` takes `start_cell` (not `cell`). The `data` argument is a JSON string of a 2D array.
+#### Validation & submission tools
+| Tool Name | Arguments Json | Description |
+|-----------|---------------|-------------|
+| `validate_partial` | `{}` | Check partial progress — how many hidden tests pass/fail (no answers revealed) |
+| `submit_workbook` | `{}` | Submit for final evaluation — returns pass rate and per-check results |
+#### History tools
+| Tool Name | Arguments Json | Description |
+|-----------|---------------|-------------|
+| `get_edit_history` | `{}` | Full list of cell edits: sheet, cell, value, step number |
+### Important notes
+- All string parameters are required — no optional arguments on any tool
+- `write_cell` values starting with `=` are treated as formulas (e.g. `"=VLOOKUP(A2,Sheet2!A:B,2,FALSE)"`)
+- `write_range` data must be a JSON string: `"[[1, 2], [3, 4]]"` not `[[1, 2], [3, 4]]`
+- Writing outside target regions incurs a reward penalty
+- Use `validate_partial` before `submit_workbook` to check progress without ending the task
+### Run locally
 ```bash
+cd spreadsheet
+pip install -e .
+# Start the environment server
+docker build -t openenv-spreadsheet -f Dockerfile .
 docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet
+# Verify it's running
 curl http://localhost:8000/health
+# Open the playground in your browser
+open http://localhost:8000/web/
 ```
+## Hugging Face Space Deployment
+This Space is built from OpenEnV environment `spreadsheet`.
+- **Space URL**: `https://huggingface.co/spaces/huzzle-labs/spreadsheet`
+- **OpenEnV pinned ref**: `0.2.3`
+- **Hub tag**: `openenv`
+### Connecting from Code
+Connect using the `SpreadsheetEnv` client:
 ```python
+from spreadsheet import SpreadsheetAction, SpreadsheetEnv
+with SpreadsheetEnv.from_env("huzzle-labs/spreadsheet") as env:
+    obs = env.reset()
+    obs = await env.step(SpreadsheetAction(
+        tool_name="list_scenarios",
+        arguments_json="{}"
+    ))
+    obs = await env.step(SpreadsheetAction(
+        tool_name="load_scenario",
+        arguments_json='{"scenario_id": "formula_repair_01"}'
+    ))
+    obs = await env.step(SpreadsheetAction(
+        tool_name="read_range",
+        arguments_json='{"sheet": "Summary", "range": "A1:F10"}'
+    ))
 ```
+Or connect directly to a running server:
+```python
+env = SpreadsheetEnv(base_url="https://huzzle-labs-spreadsheet.hf.space")
 ```
+## What Is This Gym?
+The Spreadsheet gym gives an LLM agent a real `.xlsx` workbook and a task description. The agent must use MCP tools to read sheets, understand the structure, write values or formulas, and submit the workbook for automated evaluation against hidden test checks. Every edit is tracked, and the agent must stay within target regions and step budgets.
+Unlike typical code-generation or QA benchmarks, this gym requires:
+- **Structured state tracking** — understanding multi-sheet workbook layouts with varying column structures
+- **Cross-sheet reasoning** — performing lookups, aggregations, and reconciliations across sheets
+- **Exact edit correctness** — writing precise formulas and values that pass deterministic hidden tests
+- **Strategic tool use** — using `validate_partial` to check progress before committing with `submit_workbook`
+## Task Families (12 Scenarios)
+### Formula Repair (2 scenarios)
+Fix broken formulas in multi-department workbooks. Diagnose incorrect references, cascading errors, and wrong aggregation functions.
+| Scenario | Description | Max Steps |
+|---|---|---|
+| `formula_repair_01` | Fix broken formulas in a multi-department budget workbook | 50 |
+| `formula_repair_02` | Fix cascading formula errors in a 5-year financial projection | 50 |
+### Cross-Sheet Lookup (2 scenarios)
+Aggregate data across multiple sheets using lookups and cross-references.
+| Scenario | Description | Max Steps |
+|---|---|---|
+| `cross_sheet_lookup_01` | Aggregate product revenue by region/category across quarterly sheets | 50 |
+| `cross_sheet_lookup_02` | Calculate employee bonuses by cross-referencing Employees and Bonus_Tiers | 50 |
+### Conditional Aggregation (2 scenarios)
+Apply tiered calculations with conditional logic and priority-based allocation.
+| Scenario | Description | Max Steps |
+|---|---|---|
+| `conditional_aggregation_01` | Calculate tiered sales commissions for 15 salespeople | 50 |
+| `conditional_aggregation_02` | Allocate a fixed budget across 20 requests with priority-based rates | 50 |
+### Ledger Reconciliation (2 scenarios)
+Match and reconcile transactions across bank statements and internal ledgers.
+| Scenario | Description | Max Steps |
+|---|---|---|
+| `ledger_reconciliation_01` | Reconcile bank statement against internal ledger — find mismatches | 50 |
+| `ledger_reconciliation_02` | Reconcile USD and EUR transaction sheets into a unified summary | 50 |
+### Messy Table Extraction (1 scenario)
+Extract and clean data from poorly formatted raw exports.
+| Scenario | Description | Max Steps |
+|---|---|---|
+| `messy_table_extraction_01` | Extract/clean invoice data from messy export with mixed formats | 50 |
+### Range Transformation (1 scenario)
+Reshape and pivot data between long-format and wide-format layouts.
+| Scenario | Description | Max Steps |
+|---|---|---|
+| `range_transformation_01` | Pivot long-format employee metrics into wide-format table | 50 |
+### Schedule Grid Fill (1 scenario)
+Fill structured grids respecting constraints and rules.
+| Scenario | Description | Max Steps |
+|---|---|---|
+| `schedule_grid_fill_01` | Fill employee schedule grid for 12 employees × 7 days | 50 |
+### Buggy Template Fix (1 scenario)
+Debug template workbooks with multiple interacting formula errors.
+| Scenario | Description | Max Steps |
+|---|---|---|
+| `buggy_template_fix_01` | Debug quarterly financial report template with broken Annual_Summary | 50 |
+## Architecture
+```
+┌─────────────────────────────────────────┐
+│           OpenEnv Server (:8000)        │
+│  ┌────────────┐  ┌───────────────────┐  │
+│  │  FastMCP   │──│ SpreadsheetEnv    │  │
+│  │ (13 tools) │  │ (MCPEnvironment)  │  │
+│  └────────────┘  └────────┬──────────┘  │
+│                           │             │
+│              ┌────────────┼──────────┐  │
+│              │  Workbook  │ Scenario │  │
+│              │  Engine    │  Loader  │  │
+│              │ (openpyxl) │          │  │
+│              └────────────┴──────────┘  │
+└─────────────────────────────────────────┘
 ```
+All state is in-memory per session. No database, no external APIs. The workbook engine manages `.xlsx` files via openpyxl, tracks edits, and evaluates hidden tests. Formula evaluation uses the `formulas` library.
+## MCP Tools (13 total)
+### Session Management (4 tools)
+| Tool | Description |
+|------|-------------|
+| `get_session_info` | Get current session metadata (scenario, step count, edit count, solved) |
+| `list_scenarios` | List all available scenarios with description and max steps |
+| `load_scenario` | Load a scenario and its workbook by ID |
+| `reset_scenario` | Restore workbook to original state (scenario stays loaded) |
+### Reading (4 tools)
+| Tool | Description |
+|------|-------------|
+| `list_sheets` | List all sheets with names, dimensions, visibility |
+| `read_range` | Read cells from a sheet in A1 notation (formulas as strings) |
+| `inspect_formula` | Get raw formula string from a specific cell |
+| `list_named_targets` | Show allowed output zones for the scenario |
+### Writing (2 tools)
+| Tool | Description |
+|------|-------------|
+| `write_cell` | Write a value or formula to a single cell |
+| `write_range` | Write a 2D block of values starting from a cell |
+### Validation & Submission (2 tools)
+| Tool | Description |
+|------|-------------|
+| `validate_partial` | Check progress against hidden tests without revealing answers |
+| `submit_workbook` | Submit for final evaluation (pass rate + per-check results) |
+### History (1 tool)
+| Tool | Description |
+|------|-------------|
+| `get_edit_history` | Full edit log with sheet, cell, value, step number |
 ## Reward System
+This gym ships with **two** reward modes, selectable via `--reward-mode`:
+### Custom Rewards — Episode-Level (`rewards/checks.py`)
+The `SpreadsheetChecker` verifies ground truth from the episode trajectory and computes a weighted score:
+| Component | Weight | Description |
+|---|---|---|
+| `quality` | 0.25 | F1 of expected vs used tools + success rate |
+| `efficiency` | 0.15 | `1.0 - (actual_steps / max_steps)` — fewer steps = higher |
+| `ground_truth` | 0.60 | Hidden test pass rate from `submit_workbook` |
+| `penalty` | variable | -0.5 (all calls succeed but 0% GT) or -0.2 (<30% GT) |
 ```
 total = 0.25 × quality + 0.15 × efficiency + 0.60 × ground_truth + penalty
 ```
+```python
+from rewards.checks import SpreadsheetChecker
+checker = SpreadsheetChecker()
+checker.set_episode(episode)
+reward = checker.compute_episode_reward()
+# {'quality': 0.72, 'efficiency': 0.65, 'ground_truth': 0.80, ..., 'total': 0.68}
+```
+The base `RewardCalculator` (`rewards/base.py`) wraps this into the standard 3-component formula used across all gyms.
+### OpenEnV Transforms — Per-Step (`rewards/transforms.py`)
+The `SpreadsheetStepTransform` provides fine-grained per-step rewards for RL training (GRPO). Each tool call receives a reward based on its outcome:
+| Tool | Success | Failure |
+|---|---|---|
+| `read_range` / `list_sheets` | 0.0 (neutral) | 0.0 |
+| `inspect_formula` | +0.05 | 0.0 |
+| `validate_partial` (improved) | +0.10 | +0.05 |
+| `write_cell` / `write_range` (in target, after read) | +0.10 | -0.10 |
+| `write_cell` / `write_range` (out of target) | -0.10 | -0.10 |
+| `submit_workbook` (100% pass) | +0.50 | — |
+| `submit_workbook` (>50% pass) | +0.20 | — |
+| `submit_workbook` (<30% pass) | -0.10 | — |
+```python
+from rewards.transforms import SpreadsheetStepTransform
+transform = SpreadsheetStepTransform()
+scored_obs = transform(observation)
+print(scored_obs.reward)  # e.g., +0.10 for a write in target after reading
+```
+The `OpenEnvRewardCalculator` (`rewards/base.py`) combines per-step rewards with ground truth into the same weighted formula, using sign-based quality scoring.
+## Evaluation
+The included `run_eval.py` runs an LLM agent against scenarios and scores results.
+### Quick Start
 ```bash
+cd spreadsheet
+pip install -e .
+# Build and run the environment
+docker build -t openenv-spreadsheet -f Dockerfile .
+docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet
+# Verify
+curl http://localhost:8000/health
+# Evaluate (single model, custom rewards)
+python run_eval.py --model gpt-5.4 --save --trajectory
+# Evaluate (multiple models, per-step rewards)
+python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 \
+  --parallel 3 --reward-mode openenv --save --trajectory
+# Evaluate a specific scenario
+python run_eval.py --model gpt-5.4 --scenario formula_repair_01
+# Cleanup
+docker stop spreadsheet && docker rm spreadsheet
 ```
+### Output Paths
+| Output | Path |
+|---|---|
+| Results markdown | `outputs/results/<run_id>.md` |
+| Trajectory JSON | `outputs/trajectories/<run_id>/<model>.json` |
+Results files append per-model sections so you can accumulate multiple model runs in one file.
+### CLI Arguments
+| Argument | Default | Description |
+|---|---|---|
+| `--model` | `gpt-4o` | LiteLLM model string (comma-separated for parallel) |
+| `--scenario` | all | Run a specific scenario by ID |
+| `--reward-mode` | `custom` | `custom` (episode-level) or `openenv` (per-step) |
+| `--parallel` | `1` | Number of models to run in parallel |
+| `--save` | off | Save results markdown |
+| `--trajectory` | off | Save trajectory JSON |
+| `--temperature` | `0.0` | LLM sampling temperature |
+| `--max-tokens` | `1024` | Max tokens per LLM response |
+| `--run-id` | auto | Run identifier for grouping outputs |
+| `--verbose` | off | Enable debug logging |
+## Project Structure
+```
+spreadsheet/
+├── __init__.py                  # Package exports (env + rewards)
+├── client.py                    # OpenEnv client integration
+├── models.py                    # Action/Observation data models
+├── openenv.yaml                 # OpenEnv AutoEnv manifest
+├── pyproject.toml               # Dependencies (openenv-core v0.2.3)
+├── Dockerfile                   # Root Dockerfile for HF Spaces
+├── .dockerignore
+├── run_eval.py                  # LLM evaluation runner
+│
+├── rewards/                     # Reward system (both modes)
+│   ├── __init__.py
+│   ├── base.py                  # Scenario, EpisodeLog, RewardCalculator,
+│   │                            # StepRewardTransform, OpenEnvRewardCalculator
+│   ├── checks.py                # SpreadsheetChecker (episode-level)
+│   └── transforms.py            # SpreadsheetStepTransform (per-step)
+│
+├── scenarios/                   # Scenario definitions + JSON configs
+│   ├── __init__.py
+│   ├── definitions.py           # 12 Scenario objects (Python)
+│   └── *.json                   # Scenario board configs
+│
+├── agent/                       # LLM agent runner
+│   ├── __init__.py
+│   ├── llm.py                   # LiteLLM wrapper
+│   └── runner.py                # AgentRunner (gym-agnostic)
+│
+├── server/                      # OpenEnv environment server
+│   ├── __init__.py
+│   ├── app.py                   # FastAPI + FastMCP server
+│   ├── spreadsheet_environment.py  # MCPEnvironment implementation
+│   ├── workbook_engine.py       # Workbook engine (openpyxl)
+│   ├── formula_utils.py         # Formula evaluation
+│   ├── scenario_loader.py       # Scenario JSON loader
+│   └── Dockerfile               # Server-only Dockerfile
+│
+├── workbooks/                   # Workbook files
+│   ├── templates/               # Base workbook templates
+│   ├── fixtures/                # Test fixture workbooks
+│   └── hidden_tests/            # Hidden test check definitions
+│
+└── outputs/                     # Evaluation outputs (gitignored)
+    ├── results/                 # Markdown result files
+    └── trajectories/            # JSON trajectory files
+```
+## Configuration (.env)
+Copy `.env.example` to `.env` and fill in your API keys:
+```bash
+cp .env.example .env
+# Edit .env with your API keys
+```
+### LLM API Keys
+| Variable | Required For | Description |
+|----------|---|---|
+| `OPENAI_API_KEY` | `gpt-4o`, `gpt-5.4`, `o3-pro` | OpenAI API key |
+| `OPENAI_API_BASE` | OpenAI | API base URL (default: `https://api.openai.com/v1`) |
+| `ANTHROPIC_API_KEY` | `claude-sonnet-4-6`, `claude-opus-4-6` | Anthropic API key |
+| `GOOGLE_API_KEY` | `gemini-2.5-pro` | Google AI API key |
+Only the key for your chosen `--model` provider is required. For local models via Ollama, no key is needed.
+### LLM Defaults
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `LLM_MODEL` | `gpt-4o` | Default model when `--model` is not specified |
+| `LLM_TEMPERATURE` | `0.0` | Default sampling temperature |
+| `LLM_MAX_TOKENS` | `1024` | Default max tokens per response |
+### Environment Server
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `OPENENV_PORT` | `8000` | OpenEnv server port (exposed) |
+| `MAX_CONCURRENT_ENVS` | `8` | Max parallel evaluation sessions |
+| `ENABLE_WEB_INTERFACE` | `true` | Enable HF Spaces web UI |
+| `WORKBOOKS_DIR` | `workbooks` | Workbook files directory |
+| `SCENARIOS_DIR` | `scenarios` | Scenario JSON directory |
+## Concurrent Sessions
+Each evaluation session gets its own isolated workbook engine instance. Multiple agents can evaluate simultaneously against the same Docker container without interference.

__init__.py CHANGED Viewed

@@ -1,11 +1,26 @@
 """Spreadsheet Environment."""
 from .client import SpreadsheetEnv
-from .models import SpreadsheetAction, SpreadsheetObservation, SpreadsheetState
 __all__ = [
     "SpreadsheetAction",
     "SpreadsheetObservation",
     "SpreadsheetState",
-    "SpreadsheetEnv",
 ]

 """Spreadsheet Environment."""
 from .client import SpreadsheetEnv
+from .models import (
+    SpreadsheetAction,
+    SpreadsheetObservation,
+    SpreadsheetState,
+    CallToolAction,
+    CallToolObservation,
+    ListToolsAction,
+    ListToolsObservation,
+)
+from .rewards import SpreadsheetChecker, SpreadsheetStepTransform
 __all__ = [
+    "SpreadsheetEnv",
     "SpreadsheetAction",
     "SpreadsheetObservation",
     "SpreadsheetState",
+    "CallToolAction",
+    "CallToolObservation",
+    "ListToolsAction",
+    "ListToolsObservation",
+    "SpreadsheetChecker",
+    "SpreadsheetStepTransform",
 ]

agent/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+from .runner import AgentRunner
+from .llm import LLMClient
+__all__ = ["AgentRunner", "LLMClient"]

agent/llm.py ADDED Viewed

	@@ -0,0 +1,114 @@

+"""
+LLM abstraction layer using LiteLLM.
+Supports any model LiteLLM supports — switch with a single string:
+  - OpenAI:     "gpt-4o", "gpt-5.4", "o3-pro"
+  - Anthropic:  "claude-opus-4-6", "claude-sonnet-4-6"
+  - Local:      "ollama/llama3", "ollama/mistral"
+  - And 100+ more providers
+API keys are read from environment variables (loaded from root .env):
+  OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.
+Usage:
+    from agent.llm import LLMClient
+    llm = LLMClient(model="gpt-4o")
+    response = llm.chat(
+        messages=[{"role": "user", "content": "Hello"}],
+        tools=[...],
+    )
+"""
+import json
+import logging
+from typing import Any, Dict, List, Optional
+import litellm
+logger = logging.getLogger(__name__)
+class LLMClient:
+    """
+    Thin wrapper around LiteLLM for consistent tool-calling across providers.
+    The same code works whether you're hitting GPT-4o, Claude, or a local
+    Ollama model — LiteLLM handles the translation.
+    """
+    _REASONING_MODELS = {"o3-pro", "o3-mini", "o3", "o1", "o1-mini", "o1-pro", "gpt-5"}
+    def __init__(
+        self,
+        model: str,
+        temperature: float = 0.0,
+        max_tokens: int = 1024,
+    ):
+        self.model = model
+        if model in self._REASONING_MODELS:
+            self.temperature = 1.0
+            self.max_tokens = max(max_tokens, 4096)
+            if temperature != 1.0:
+                logger.info(f"Model {model} requires temperature=1.0, overriding from {temperature}")
+        else:
+            self.temperature = temperature
+            self.max_tokens = max_tokens
+    def chat(
+        self,
+        messages: List[Dict[str, Any]],
+        tools: Optional[List[Dict[str, Any]]] = None,
+    ) -> Any:
+        """
+        Send messages to the LLM and get a response.
+        Args:
+            messages: Conversation history in OpenAI format
+            tools: Optional list of tools in OpenAI function-calling format
+        Returns:
+            LiteLLM ModelResponse (same shape as OpenAI ChatCompletion).
+        """
+        kwargs: Dict[str, Any] = {
+            "model": self.model,
+            "messages": messages,
+            "temperature": self.temperature,
+            "max_tokens": self.max_tokens,
+        }
+        if tools:
+            kwargs["tools"] = tools
+            kwargs["tool_choice"] = "auto"
+        logger.debug(f"LLM request: model={self.model}, messages={len(messages)}, tools={len(tools or [])}")
+        response = litellm.completion(**kwargs)
+        logger.debug(f"LLM response: finish_reason={response.choices[0].finish_reason}")
+        return response
+    @staticmethod
+    def extract_tool_calls(response) -> List[Dict[str, Any]]:
+        """Extract tool calls from an LLM response."""
+        choice = response.choices[0]
+        if not choice.message.tool_calls:
+            return []
+        calls = []
+        for tc in choice.message.tool_calls:
+            args = tc.function.arguments
+            if isinstance(args, str):
+                args = json.loads(args)
+            calls.append({
+                "id": tc.id,
+                "name": tc.function.name,
+                "arguments": args,
+            })
+        return calls
+    @staticmethod
+    def get_text_response(response) -> Optional[str]:
+        """Extract plain text content from an LLM response (if any)."""
+        choice = response.choices[0]
+        return choice.message.content

agent/runner.py ADDED Viewed

	@@ -0,0 +1,282 @@

+"""
+Gym-agnostic Agent Runner — connects an LLM to any OpenEnv environment.
+This module is the CORE of the evaluation platform. It:
+  1. Receives a pre-connected OpenEnv client (from AutoEnv discovery)
+  2. Discovers tools via list_tools()
+  3. Gives the LLM a scenario prompt + available tools
+  4. Loops: LLM reasons → agent calls env.step() → observation → LLM reasons again
+  5. Collects an EpisodeLog with timestamps for reward calculation + trajectory logging
+Usage:
+    from openenv import AutoEnv
+    env = AutoEnv.from_env("spreadsheet", base_url="http://localhost:8000")
+    runner = AgentRunner(model="gpt-4o", env_client=env)
+    episode, breakdown = runner.run_scenario(scenario, checker)
+"""
+import json
+import logging
+import time
+from datetime import datetime, timezone, timedelta
+from typing import Any, Dict, List, Tuple
+IST = timezone(timedelta(hours=5, minutes=30))
+from openenv.core.mcp_client import MCPToolClient
+from openenv.core.env_server.mcp_types import CallToolAction, CallToolObservation, Tool
+from ..rewards.base import (
+    EpisodeLog,
+    RewardBreakdown,
+    RewardCalculator,
+    Scenario,
+    OpenEnvRewardCalculator,
+)
+from .llm import LLMClient
+logger = logging.getLogger(__name__)
+SYSTEM_PROMPT = """\
+You are an AI agent interacting with an environment through tools.
+Your job:
+1. Read the task description carefully.
+2. Use the available tools to complete the task.
+3. Call tools one at a time. Wait for each result before deciding the next step.
+4. When the task is complete, respond with a plain text summary of what you did.
+   Do NOT call any more tools after you're done.
+Rules:
+- Only use tools that are listed as available.
+- Provide all required arguments for each tool call.
+- If a tool call fails, read the error and decide how to recover.
+- Be efficient — complete the task in as few steps as possible.
+- When you're done, clearly state what you accomplished.
+"""
+def mcp_tools_to_openai(tools: List[Tool]) -> List[Dict[str, Any]]:
+    """Convert OpenEnv MCP tool definitions to OpenAI function-calling format."""
+    openai_tools = []
+    for tool in tools:
+        schema = tool.input_schema or {"type": "object", "properties": {}}
+        if "type" not in schema:
+            schema["type"] = "object"
+        if "properties" not in schema:
+            schema["properties"] = {}
+        openai_tools.append({
+            "type": "function",
+            "function": {
+                "name": tool.name,
+                "description": tool.description or "",
+                "parameters": schema,
+            },
+        })
+    return openai_tools
+def _observation_to_str(step_result) -> str:
+    """Convert an OpenEnv step result to a string the LLM can read."""
+    obs = step_result.observation
+    if isinstance(obs, CallToolObservation):
+        if obs.error:
+            return json.dumps({"error": obs.error.message}, indent=2)
+        result = obs.result
+        if hasattr(result, "data"):
+            result = result.data
+        elif isinstance(result, dict) and "data" in result:
+            result = result["data"]
+        try:
+            return json.dumps(result, indent=2, default=str)
+        except (TypeError, ValueError):
+            return str(result)
+    if hasattr(obs, "metadata") and obs.metadata:
+        return json.dumps(obs.metadata, indent=2, default=str)
+    return str(obs)
+class AgentRunner:
+    """
+    Gym-agnostic agent that connects an LLM to any OpenEnv environment.
+    Reward modes:
+      - "custom"  (default): Episode-level reward via RewardCalculator
+      - "openenv": Per-step reward via Transform + ground truth
+    """
+    def __init__(
+        self,
+        model: str,
+        env_client: MCPToolClient,
+        temperature: float = 0.0,
+        max_tokens: int = 1024,
+        reward_mode: str = "custom",
+        transform=None,
+    ):
+        self.llm = LLMClient(
+            model=model,
+            temperature=temperature,
+            max_tokens=max_tokens,
+        )
+        self.env_client = env_client
+        self.reward_mode = reward_mode
+        self.transform = transform
+        self.calculator = RewardCalculator()
+        if reward_mode == "openenv":
+            self.openenv_calculator = OpenEnvRewardCalculator()
+    def run_scenario(
+        self,
+        scenario: Scenario,
+        checker: Any,
+    ) -> Tuple[EpisodeLog, RewardBreakdown]:
+        """Run a single scenario through the LLM agent."""
+        return self._execute(scenario, checker, self.env_client)
+    def _execute(
+        self,
+        scenario: Scenario,
+        checker: Any,
+        env: MCPToolClient,
+    ) -> Tuple[EpisodeLog, RewardBreakdown]:
+        env.reset()
+        session_id = None
+        try:
+            session_result = env.step(
+                CallToolAction(tool_name="get_session_info", arguments={})
+            )
+            obs = session_result.observation
+            if isinstance(obs, CallToolObservation) and obs.result:
+                result_data = obs.result
+                if hasattr(result_data, "data"):
+                    result_data = result_data.data
+                elif isinstance(result_data, dict) and "data" in result_data:
+                    result_data = result_data["data"]
+                if isinstance(result_data, dict):
+                    session_id = result_data.get("session_id")
+                elif isinstance(result_data, str):
+                    import json as _json
+                    try:
+                        parsed = _json.loads(result_data)
+                        session_id = parsed.get("session_id")
+                    except (ValueError, TypeError):
+                        pass
+        except Exception as e:
+            logger.warning(f"Could not get session_id: {e}")
+        if session_id and hasattr(checker, "set_session"):
+            checker.set_session(session_id)
+            logger.info(f"Session-scoped checker -> {session_id}")
+        if self.transform and hasattr(self.transform, "set_scenario"):
+            self.transform.set_scenario(scenario)
+        all_tools = env.list_tools(use_cache=False)
+        tools = [t for t in all_tools if t.name != "get_session_info"]
+        openai_tools = mcp_tools_to_openai(tools)
+        tool_names = [t.name for t in tools]
+        logger.info(f"Discovered {len(tools)} agent tools: {tool_names}")
+        messages = [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": scenario.prompt},
+        ]
+        episode = EpisodeLog()
+        step_rewards = []
+        final_answer = None
+        for step_num in range(1, scenario.max_steps + 1):
+            logger.info(f"Step {step_num}/{scenario.max_steps}")
+            response = self.llm.chat(messages, tools=openai_tools)
+            tool_calls = LLMClient.extract_tool_calls(response)
+            if not tool_calls:
+                final_answer = LLMClient.get_text_response(response)
+                logger.info(f"Agent done. Final answer: {(final_answer or '')[:100]}...")
+                break
+            messages.append(response.choices[0].message.model_dump())
+            for tc in tool_calls:
+                tool_name = tc["name"]
+                arguments = tc["arguments"]
+                call_id = tc["id"]
+                logger.info(f"  Tool: {tool_name}({json.dumps(arguments, default=str)[:100]})")
+                step_ts = datetime.now(IST).isoformat()
+                step_start = time.time()
+                error_msg = None
+                try:
+                    step_result = env.step(
+                        CallToolAction(tool_name=tool_name, arguments=arguments)
+                    )
+                    obs = step_result.observation
+                    is_error = (
+                        isinstance(obs, CallToolObservation)
+                        and obs.error is not None
+                    )
+                    result_str = _observation_to_str(step_result)
+                    if is_error and isinstance(obs, CallToolObservation):
+                        error_msg = obs.error.message
+                except Exception as exc:
+                    is_error = True
+                    error_msg = str(exc)
+                    result_str = json.dumps({"error": error_msg})
+                    obs = None
+                step_elapsed = time.time() - step_start
+                if self.reward_mode == "openenv" and self.transform and obs is not None:
+                    transformed = self.transform(obs)
+                    step_rewards.append(
+                        transformed.reward if transformed.reward is not None else 0.0
+                    )
+                episode.add_step(
+                    tool_name=tool_name,
+                    arguments=arguments,
+                    success=not is_error,
+                    result=result_str,
+                    error=error_msg,
+                    timestamp=step_ts,
+                    elapsed=step_elapsed,
+                )
+                logger.info(f"    -> success={not is_error} ({step_elapsed:.2f}s)")
+                messages.append({
+                    "role": "tool",
+                    "tool_call_id": call_id,
+                    "content": result_str,
+                })
+        if hasattr(checker, "set_episode"):
+            checker.set_episode(episode)
+        outcome_results = checker.check_all(scenario.outcome_checks)
+        if self.reward_mode == "openenv":
+            breakdown = self.openenv_calculator.calculate(
+                step_rewards=step_rewards,
+                outcome_results=outcome_results,
+                max_steps=scenario.max_steps,
+                actual_steps=len(episode.steps),
+            )
+        else:
+            breakdown = self.calculator.calculate(
+                episode=episode,
+                scenario=scenario,
+                outcome_results=outcome_results,
+            )
+        return episode, breakdown

pyproject.toml CHANGED Viewed

@@ -8,7 +8,7 @@ version = "0.1.0"
 description = "Spreadsheet gym — exact workbook manipulation and reasoning over realistic spreadsheet tasks"
 requires-python = ">=3.11"
 dependencies = [
-    "openenv-core @ git+https://github.com/meta-pytorch/OpenEnv.git@v0.2.1",
     "fastapi>=0.115.0",
     "pydantic>=2.0.0",
     "uvicorn[standard]>=0.24.0",
@@ -17,6 +17,9 @@ dependencies = [
     "openpyxl>=3.1.0",
     "pandas>=2.0.0",
     "formulas>=1.2.0",
 ]
 [project.optional-dependencies]
@@ -30,8 +33,21 @@ server = "spreadsheet.server.app:main"
 [tool.setuptools]
 include-package-data = true
-packages = ["spreadsheet", "spreadsheet.server"]
-package-dir = { "spreadsheet" = ".", "spreadsheet.server" = "server" }
 [tool.setuptools.package-data]
 spreadsheet = ["openenv.yaml"]

 description = "Spreadsheet gym — exact workbook manipulation and reasoning over realistic spreadsheet tasks"
 requires-python = ">=3.11"
 dependencies = [
+    "openenv-core @ git+https://github.com/meta-pytorch/OpenEnv.git@v0.2.3",
     "fastapi>=0.115.0",
     "pydantic>=2.0.0",
     "uvicorn[standard]>=0.24.0",
     "openpyxl>=3.1.0",
     "pandas>=2.0.0",
     "formulas>=1.2.0",
+    "litellm>=1.0.0",
+    "pyyaml>=6.0.0",
+    "python-dotenv>=1.0.0",
 ]
 [project.optional-dependencies]
 [tool.setuptools]
 include-package-data = true
+packages = [
+    "spreadsheet",
+    "spreadsheet.server",
+    "spreadsheet.rewards",
+    "spreadsheet.scenarios",
+    "spreadsheet.agent",
+]
+[tool.setuptools.package-dir]
+spreadsheet = "."
+"spreadsheet.server" = "server"
+"spreadsheet.rewards" = "rewards"
+"spreadsheet.scenarios" = "scenarios"
+"spreadsheet.agent" = "agent"
 [tool.setuptools.package-data]
 spreadsheet = ["openenv.yaml"]
+"spreadsheet.scenarios" = ["*.json"]

rewards/__init__.py ADDED Viewed

	@@ -0,0 +1,23 @@

+from .checks import SpreadsheetChecker
+from .transforms import SpreadsheetStepTransform
+from .base import (
+    Scenario,
+    EpisodeLog,
+    StepLog,
+    RewardBreakdown,
+    RewardCalculator,
+    StepRewardTransform,
+    OpenEnvRewardCalculator,
+)
+__all__ = [
+    "SpreadsheetChecker",
+    "SpreadsheetStepTransform",
+    "Scenario",
+    "EpisodeLog",
+    "StepLog",
+    "RewardBreakdown",
+    "RewardCalculator",
+    "StepRewardTransform",
+    "OpenEnvRewardCalculator",
+]

rewards/base.py ADDED Viewed

	@@ -0,0 +1,313 @@

+"""
+Base reward infrastructure — data classes, calculators, and transforms.
+Merged from the shared repo-level modules into a self-contained file:
+  - Episode-level: RewardCalculator (custom mode)
+  - Per-step: StepRewardTransform + OpenEnvRewardCalculator (openenv mode)
+Scoring formula (both modes):
+    total = 0.25 * quality/structural + 0.15 * efficiency + 0.60 * ground_truth + penalty
+Usage:
+    from rewards.base import RewardCalculator, Scenario, EpisodeLog
+    calculator = RewardCalculator()
+    breakdown = calculator.calculate(episode, scenario, outcome_results)
+"""
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Set
+from openenv.core.env_server.interfaces import Transform
+from openenv.core.env_server.mcp_types import CallToolObservation
+from openenv.core.env_server.types import Observation
+# ── Data Classes ──
+@dataclass
+class StepLog:
+    """Record of a single tool call made by the agent."""
+    tool_name: str
+    arguments: Dict[str, Any]
+    success: bool
+    result: Any = None
+    error: Optional[str] = None
+    timestamp: Optional[str] = None
+    elapsed: float = 0.0
+@dataclass
+class EpisodeLog:
+    """Record of all tool calls in one episode."""
+    steps: List[StepLog] = field(default_factory=list)
+    def add_step(
+        self,
+        tool_name: str,
+        arguments: Dict[str, Any],
+        success: bool,
+        result: Any = None,
+        error: Optional[str] = None,
+        timestamp: Optional[str] = None,
+        elapsed: float = 0.0,
+    ) -> None:
+        self.steps.append(
+            StepLog(
+                tool_name=tool_name,
+                arguments=arguments,
+                success=success,
+                result=result,
+                error=error,
+                timestamp=timestamp,
+                elapsed=elapsed,
+            )
+        )
+    @property
+    def tools_used(self) -> List[str]:
+        return [s.tool_name for s in self.steps]
+    @property
+    def tools_used_set(self) -> Set[str]:
+        return set(self.tools_used)
+@dataclass
+class Scenario:
+    """Definition of a task for the agent."""
+    id: str
+    prompt: str
+    expected_tools: List[str]
+    max_steps: int
+    outcome_checks: List[Dict[str, Any]]
+@dataclass
+class RewardBreakdown:
+    """Detailed reward breakdown — useful for debugging and logging."""
+    structural: float = 0.0
+    ground_truth: float = 0.0
+    efficiency: float = 0.0
+    penalty: float = 0.0
+    total: float = 0.0
+    details: Dict[str, Any] = field(default_factory=dict)
+    def summary(self) -> str:
+        mode = self.details.get("reward_mode", "custom")
+        qual_label = "Quality" if mode == "openenv" else "Structural"
+        lines = [
+            f"  {qual_label + ':':14s}{self.structural:.2f}  (weight 0.25)",
+            f"  Efficiency:   {self.efficiency:.2f}  (weight 0.15)",
+            f"  Ground Truth: {self.ground_truth:.2f}  (weight 0.60)",
+        ]
+        if self.penalty < 0:
+            lines.append(f"  Penalty:      {self.penalty:.2f}  (hallucination)")
+        lines.append(f"  ────────────────────────")
+        lines.append(f"  TOTAL:        {self.total:.2f}")
+        return "\n".join(lines)
+# ── Episode-Level Reward Calculator (custom mode) ──
+class RewardCalculator:
+    """
+    Computes episode-level reward from logs + scenario + verification results.
+    Weights: structural (0.25), ground_truth (0.60), efficiency (0.15).
+    """
+    def __init__(
+        self,
+        w_structural: float = 0.25,
+        w_ground_truth: float = 0.60,
+        w_efficiency: float = 0.15,
+    ):
+        self.w_structural = w_structural
+        self.w_ground_truth = w_ground_truth
+        self.w_efficiency = w_efficiency
+    def calculate(
+        self,
+        episode: EpisodeLog,
+        scenario: Scenario,
+        outcome_results: List[float],
+    ) -> RewardBreakdown:
+        breakdown = RewardBreakdown()
+        breakdown.structural = self._structural_score(episode, scenario)
+        breakdown.ground_truth = self._ground_truth_score(outcome_results)
+        breakdown.efficiency = self._efficiency_score(episode, scenario)
+        breakdown.penalty = self._hallucination_penalty(episode, outcome_results)
+        breakdown.total = (
+            self.w_structural * breakdown.structural
+            + self.w_ground_truth * breakdown.ground_truth
+            + self.w_efficiency * breakdown.efficiency
+            + breakdown.penalty
+        )
+        breakdown.total = max(-1.0, min(1.0, breakdown.total))
+        breakdown.details = {
+            "tools_expected": scenario.expected_tools,
+            "tools_used": episode.tools_used,
+            "outcome_checks_score_sum": sum(outcome_results),
+            "outcome_checks_total": len(outcome_results),
+            "outcome_checks_avg": sum(outcome_results) / len(outcome_results) if outcome_results else 0.0,
+            "steps_taken": len(episode.steps),
+            "max_steps": scenario.max_steps,
+        }
+        return breakdown
+    def _structural_score(self, episode: EpisodeLog, scenario: Scenario) -> float:
+        if not episode.steps:
+            return 0.0
+        expected = set(scenario.expected_tools)
+        used = episode.tools_used_set
+        intersection = expected & used
+        precision = len(intersection) / len(used) if used else 0.0
+        recall = len(intersection) / len(expected) if expected else 0.0
+        f1 = (
+            2 * precision * recall / (precision + recall)
+            if (precision + recall) > 0
+            else 0.0
+        )
+        success_rate = sum(1 for s in episode.steps if s.success) / len(episode.steps)
+        unexpected_calls = sum(
+            1 for s in episode.steps if s.tool_name not in expected
+        )
+        unexpected_ratio = unexpected_calls / len(episode.steps)
+        return max(0.0, 0.6 * f1 + 0.4 * success_rate - unexpected_ratio * 0.3)
+    def _ground_truth_score(self, outcome_results: List[float]) -> float:
+        if not outcome_results:
+            return 0.0
+        return sum(outcome_results) / len(outcome_results)
+    def _efficiency_score(self, episode: EpisodeLog, scenario: Scenario) -> float:
+        if not episode.steps:
+            return 0.0
+        return max(0.0, 1.0 - len(episode.steps) / scenario.max_steps)
+    def _hallucination_penalty(
+        self, episode: EpisodeLog, outcome_results: List[float]
+    ) -> float:
+        if not episode.steps or not outcome_results:
+            return 0.0
+        all_calls_succeeded = all(s.success for s in episode.steps)
+        pass_rate = sum(outcome_results) / len(outcome_results)
+        if all_calls_succeeded and pass_rate == 0.0:
+            return -0.5
+        if all_calls_succeeded and pass_rate < 0.3:
+            return -0.2
+        return 0.0
+# ── Per-Step Reward Transform (openenv mode) ──
+class StepRewardTransform(Transform):
+    """
+    Gym-agnostic per-step reward transform.
+    Sets observation.reward based on tool call success/failure.
+    Subclass for gym-specific logic (see transforms.py).
+    """
+    def __call__(self, observation: Observation) -> Observation:
+        reward = self._compute_reward(observation)
+        observation.reward = reward
+        return observation
+    def _compute_reward(self, observation: Observation) -> float:
+        if isinstance(observation, CallToolObservation):
+            if observation.error is not None:
+                return -0.5
+            return 1.0
+        return 0.0
+class OpenEnvRewardCalculator:
+    """
+    Combines per-step transform rewards with ground truth verification.
+    Used as the alternative to RewardCalculator when --reward-mode openenv.
+    Quality is sign-based: only the sign of per-step rewards matters
+    (positive = productive, negative = harmful, zero = neutral).
+    """
+    def __init__(
+        self,
+        w_quality: float = 0.25,
+        w_efficiency: float = 0.15,
+        w_ground_truth: float = 0.60,
+    ):
+        self.w_quality = w_quality
+        self.w_efficiency = w_efficiency
+        self.w_ground_truth = w_ground_truth
+    def calculate(
+        self,
+        step_rewards: List[float],
+        outcome_results: List[bool],
+        max_steps: int = 0,
+        actual_steps: int = 0,
+    ) -> RewardBreakdown:
+        productive = sum(1 for r in step_rewards if r > 0)
+        harmful = sum(1 for r in step_rewards if r < 0)
+        active = productive + harmful
+        quality = productive / active if active > 0 else 0.0
+        if max_steps > 0 and actual_steps > 0:
+            efficiency = max(0.0, 1.0 - actual_steps / max_steps)
+        else:
+            efficiency = 0.0
+        gt_score = sum(outcome_results) / len(outcome_results) if outcome_results else 0.0
+        penalty = 0.0
+        if step_rewards and outcome_results:
+            all_positive = all(r > 0 for r in step_rewards)
+            if all_positive and gt_score == 0.0:
+                penalty = -0.5
+            elif all_positive and gt_score < 0.3:
+                penalty = -0.2
+        total = (
+            self.w_quality * quality
+            + self.w_efficiency * efficiency
+            + self.w_ground_truth * gt_score
+            + penalty
+        )
+        total = max(-1.0, min(1.0, total))
+        return RewardBreakdown(
+            structural=quality,
+            ground_truth=gt_score,
+            efficiency=efficiency,
+            penalty=penalty,
+            total=total,
+            details={
+                "reward_mode": "openenv",
+                "productive_steps": productive,
+                "harmful_steps": harmful,
+                "neutral_steps": len(step_rewards) - active,
+                "actual_steps": actual_steps,
+                "max_steps": max_steps,
+            },
+        )

rewards/checks.py ADDED Viewed

	@@ -0,0 +1,230 @@

+"""
+Spreadsheet outcome checks — ground truth from episode trajectory.
+Ground truth is reconstructed from the episode log:
+  - submit_workbook tool call: pass_rate, per-check results
+  - validate_partial calls: intermediate progress
+  - write_cell/write_range calls: edit targets and counts
+Check types:
+  - hidden_test_pass_rate : fraction of hidden checks that passed on final submit
+Layer 2 (custom mode) uses RewardCalculator from rewards/base.py with
+structural + ground_truth + efficiency. This checker provides outcome_results.
+Additionally, compute_custom_reward() provides the 6-component detailed
+breakdown specified in the spreadsheet gym prompt:
+  hidden_test_pass_rate  (0.40)
+  formula_correctness    (0.20)
+  edit_efficiency        (0.10)
+  invalid_edit_penalty   (0.10)
+  structural_integrity   (0.10)
+  debugging_quality      (0.10)
+"""
+from __future__ import annotations
+import json
+from typing import Any, Dict, List, Optional
+WRITE_TOOLS = frozenset({"write_cell", "write_range"})
+READ_TOOLS = frozenset({"read_range", "read_cell"})
+def _parse_result(result: Any) -> dict:
+    if isinstance(result, dict):
+        return result
+    if isinstance(result, str):
+        try:
+            return json.loads(result)
+        except (json.JSONDecodeError, TypeError):
+            return {}
+    if hasattr(result, "data") and isinstance(result.data, dict):
+        return result.data
+    return {}
+class SpreadsheetChecker:
+    """Verifies outcomes from the episode log + scenario outcome_checks."""
+    def __init__(self, api_url: str | None = None, session_id: str | None = None):
+        self._api_url = api_url
+        self._session_id = session_id
+        self._steps: List[dict] = []
+    def set_episode(self, episode) -> None:
+        """Populate from episode trajectory (called by run_eval before check_all)."""
+        self._steps = []
+        for step in episode.steps:
+            self._steps.append({
+                "tool_name": step.tool_name,
+                "arguments": step.arguments or {},
+                "success": step.success,
+                "result": _parse_result(step.result),
+            })
+    def set_session(self, session_id: str) -> None:
+        self._session_id = session_id
+    def check_all(self, checks: List[Dict[str, Any]]) -> List[bool]:
+        return [self._run_check(c) for c in checks]
+    def _run_check(self, check: Dict[str, Any]) -> bool:
+        check_type = check.get("type", "")
+        if check_type == "hidden_test_pass_rate":
+            return self._check_hidden_test_pass_rate(check)
+        return False
+    def _check_hidden_test_pass_rate(self, check: dict) -> bool:
+        """Check if submit_workbook achieved the minimum pass rate."""
+        min_rate = check.get("min_pass_rate", 0.5)
+        submit_result = self._get_last_submit_result()
+        if not submit_result:
+            return False
+        return submit_result.get("pass_rate", 0) >= min_rate
+    def _get_last_submit_result(self) -> dict:
+        """Extract the result from the last submit_workbook call."""
+        for s in reversed(self._steps):
+            if s["tool_name"] == "submit_workbook" and s["success"]:
+                return s.get("result", {})
+        return {}
+    # ── 6-component custom reward (Layer 2 detailed breakdown) ─────
+    def compute_custom_reward(self) -> Dict[str, float]:
+        """
+        6-component detailed custom reward.
+        Returns dict with component scores (each 0.0–1.0) and weighted total.
+        """
+        submit = self._get_last_submit_result()
+        hidden = self._score_hidden_test_pass_rate(submit)
+        formula = self._score_formula_correctness(submit)
+        efficiency = self._score_edit_efficiency()
+        invalid = self._score_invalid_edit_penalty()
+        integrity = self._score_structural_integrity()
+        debugging = self._score_debugging_quality()
+        total = (
+            0.40 * hidden
+            + 0.20 * formula
+            + 0.10 * efficiency
+            + 0.10 * (1.0 - invalid)
+            + 0.10 * integrity
+            + 0.10 * debugging
+        )
+        return {
+            "hidden_test_pass_rate": round(hidden, 4),
+            "formula_correctness": round(formula, 4),
+            "edit_efficiency": round(efficiency, 4),
+            "invalid_edit_penalty": round(invalid, 4),
+            "structural_integrity": round(integrity, 4),
+            "debugging_quality": round(debugging, 4),
+            "total": round(max(0.0, min(1.0, total)), 4),
+        }
+    def _score_hidden_test_pass_rate(self, submit: dict) -> float:
+        """Fraction of hidden checks that passed."""
+        return submit.get("pass_rate", 0.0) if submit else 0.0
+    def _score_formula_correctness(self, submit: dict) -> float:
+        """Among formula-type checks, what fraction passed?"""
+        if not submit:
+            return 0.0
+        details = submit.get("details", [])
+        if not details:
+            return self._score_hidden_test_pass_rate(submit)
+        formula_checks = [d for d in details if d.get("check_type") == "expected_formula"]
+        if not formula_checks:
+            return 1.0
+        passed = sum(1 for d in formula_checks if d.get("passed"))
+        return passed / len(formula_checks)
+    def _score_edit_efficiency(self) -> float:
+        """Ratio of minimum plausible edits to actual write steps. Lower steps = higher score."""
+        write_steps = [s for s in self._steps if s["tool_name"] in WRITE_TOOLS and s["success"]]
+        if not write_steps:
+            return 0.0
+        unique_targets = set()
+        for s in write_steps:
+            args = s["arguments"]
+            sheet = args.get("sheet", "")
+            cell = args.get("cell", args.get("start_cell", ""))
+            unique_targets.add(f"{sheet}:{cell}")
+        min_edits = len(unique_targets)
+        actual_edits = len(write_steps)
+        if actual_edits == 0:
+            return 0.0
+        return min(min_edits / actual_edits, 1.0)
+    def _score_invalid_edit_penalty(self) -> float:
+        """Fraction of writes that targeted non-output areas (0.0 = no invalid edits)."""
+        write_steps = [s for s in self._steps if s["tool_name"] in WRITE_TOOLS and s["success"]]
+        if not write_steps:
+            return 0.0
+        invalid = sum(
+            1 for s in write_steps
+            if isinstance(s.get("result"), dict) and s["result"].get("outside_target")
+        )
+        return invalid / len(write_steps)
+    def _score_structural_integrity(self) -> float:
+        """
+        Did the agent preserve existing correct data?
+        We check whether the final submit had any checks with 'overwrite_detected' failures.
+        If no destructive overwrites detected, score = 1.0.
+        """
+        submit = self._get_last_submit_result()
+        if not submit:
+            return 0.5
+        details = submit.get("details", [])
+        if not details:
+            return 1.0
+        total = len(details)
+        overwrites = sum(1 for d in details if d.get("overwrite_detected"))
+        return 1.0 - (overwrites / total) if total > 0 else 1.0
+    def _score_debugging_quality(self) -> float:
+        """Evidence of reading before writing; inspecting formulas before modifying."""
+        if not self._steps:
+            return 0.0
+        score = 0.0
+        components = 0
+        read_before_write = self._has_read_before_write_pattern()
+        components += 1
+        score += 1.0 if read_before_write else 0.0
+        inspect_count = sum(1 for s in self._steps if s["tool_name"] == "inspect_formula")
+        if inspect_count > 0:
+            components += 1
+            score += 1.0
+        validate_count = sum(1 for s in self._steps if s["tool_name"] == "validate_partial")
+        if validate_count > 0:
+            components += 1
+            score += 1.0
+        list_sheets = any(s["tool_name"] == "list_sheets" for s in self._steps)
+        if list_sheets:
+            components += 1
+            score += 1.0
+        return score / max(components, 1)
+    def _has_read_before_write_pattern(self) -> bool:
+        """Check if at least one write was preceded by a read within 4 steps."""
+        for i, s in enumerate(self._steps):
+            if s["tool_name"] not in WRITE_TOOLS:
+                continue
+            lookback = self._steps[max(0, i - 4):i]
+            if any(p["tool_name"] in READ_TOOLS for p in lookback):
+                return True
+        return False

rewards/transforms.py ADDED Viewed

	@@ -0,0 +1,191 @@

+"""
+Spreadsheet per-step reward transform (Layer 3).
+Used when: --reward-mode openenv --gym spreadsheet
+Scoring by tool:
+  read_range / read_cell:
+    Successful read                          →  +0.02  (neutral, slightly positive)
+  write_cell / write_range:
+    Write to target region                   →  +0.05
+    Write preceded by recent read (≤4 steps) →  +0.05  (on top of base)
+    Write outside target region              →  -0.10
+    Repeated write to same cell (≥3)         →  -0.05
+  inspect_formula:
+    Successful                               →  +0.05
+  validate_partial:
+    Called successfully                       →  +0.05
+    Shows improvement over last validate     →  +0.10
+  submit_workbook:
+    All checks pass (pass_rate == 1.0)       →  +0.50
+    Partial pass (pass_rate > 0.5)           →  +0.20
+    Mostly failing (pass_rate < 0.3)         →  -0.10
+    No prior validate_partial call           →  -0.05  (unsupported submission)
+  list_sheets / list_scenarios / get_session_info / load_scenario /
+  get_edit_history / reset_scenario / list_named_targets:
+    Successful                               →  0.0  (neutral)
+  Error on any non-neutral tool              →  -0.05
+"""
+from __future__ import annotations
+import json
+from typing import Any
+from openenv.core.env_server.mcp_types import CallToolObservation
+from openenv.core.env_server.types import Observation
+from .base import StepRewardTransform
+WRITE_TOOLS = frozenset({"write_cell", "write_range"})
+READ_TOOLS = frozenset({"read_range", "read_cell"})
+NEUTRAL_TOOLS = frozenset({
+    "list_sheets", "list_scenarios", "get_session_info",
+    "load_scenario", "get_edit_history", "reset_scenario",
+    "list_named_targets", "list_tools",
+})
+def _extract_result(observation) -> Any:
+    result = getattr(observation, "result", None)
+    if hasattr(result, "data"):
+        return result.data
+    if isinstance(result, dict) and "data" in result:
+        return result["data"]
+    if isinstance(result, str):
+        try:
+            return json.loads(result)
+        except (json.JSONDecodeError, TypeError):
+            return result
+    return result
+class SpreadsheetStepTransform(StepRewardTransform):
+    """Per-step reward for Spreadsheet gym (Layer 3, trajectory-aware)."""
+    def __init__(self, scenario: dict | None = None):
+        super().__init__()
+        self._scenario = scenario or {}
+        self._recent_tools: list[str] = []
+        self._write_counts: dict[str, int] = {}
+        self._last_validate_passed: int = 0
+        self._has_validated: bool = False
+    def set_scenario(self, scenario: Any) -> None:
+        """Set scenario context (called by runner at start of each scenario)."""
+        if hasattr(scenario, "id"):
+            self._scenario = {"id": scenario.id}
+        elif isinstance(scenario, dict):
+            self._scenario = scenario
+        self._recent_tools = []
+        self._write_counts = {}
+        self._last_validate_passed = 0
+        self._has_validated = False
+    def _compute_reward(self, observation: Observation) -> float:
+        if not isinstance(observation, CallToolObservation):
+            return 0.0
+        tool_name = getattr(observation, "tool_name", "") or ""
+        result = _extract_result(observation)
+        if not isinstance(result, dict):
+            result = {}
+        has_error = (
+            observation.error is not None
+            or (isinstance(result, dict) and result.get("error"))
+        )
+        if tool_name in NEUTRAL_TOOLS:
+            self._recent_tools.append(tool_name)
+            return 0.0
+        if has_error:
+            self._recent_tools.append(tool_name)
+            return -0.05
+        reward = self._score_tool(tool_name, result)
+        self._recent_tools.append(tool_name)
+        return reward
+    def _score_tool(self, tool_name: str, result: dict) -> float:
+        if tool_name in READ_TOOLS:
+            return 0.02
+        if tool_name == "inspect_formula":
+            return 0.05
+        if tool_name in WRITE_TOOLS:
+            return self._score_write(tool_name, result)
+        if tool_name == "validate_partial":
+            return self._score_validate(result)
+        if tool_name == "submit_workbook":
+            return self._score_submit(result)
+        return 0.0
+    def _score_write(self, tool_name: str, result: dict) -> float:
+        outside_target = result.get("outside_target", False)
+        if outside_target:
+            return -0.10
+        reward = 0.05
+        lookback = self._recent_tools[-4:]
+        if any(t in READ_TOOLS for t in lookback):
+            reward += 0.05
+        cell_key = f"{result.get('sheet', '')}:{result.get('cell', result.get('start_cell', ''))}"
+        self._write_counts[cell_key] = self._write_counts.get(cell_key, 0) + 1
+        if self._write_counts[cell_key] >= 3:
+            reward -= 0.05
+        return reward
+    def _score_validate(self, result: dict) -> float:
+        self._has_validated = True
+        new_passed = result.get("passed", 0)
+        if new_passed > self._last_validate_passed:
+            self._last_validate_passed = new_passed
+            return 0.10
+        self._last_validate_passed = new_passed
+        return 0.05
+    def _score_submit(self, result: dict) -> float:
+        pass_rate = result.get("pass_rate", 0)
+        reward = 0.0
+        if pass_rate == 1.0:
+            reward = 0.50
+        elif pass_rate > 0.5:
+            reward = 0.20
+        elif pass_rate < 0.3:
+            reward = -0.10
+        if not self._has_validated:
+            reward -= 0.05
+        return reward
+def transform(trajectory: list, scenario: dict) -> list:
+    """
+    Apply per-step rewards to trajectory (used by run_eval transform_factory).
+    Returns trajectory with each step's reward populated.
+    """
+    t = SpreadsheetStepTransform(scenario=scenario)
+    for step in trajectory:
+        if hasattr(step, "observation"):
+            obs = step.observation
+            step.reward = t._compute_reward(obs)
+    return trajectory

run_eval.py ADDED Viewed

	@@ -0,0 +1,820 @@

+#!/usr/bin/env python3
+"""
+Evaluation Runner — run an LLM agent against Spreadsheet gym scenarios.
+Single-gym version of the repo-level run_eval.py, tailored for the
+spreadsheet environment. No --gym flag needed.
+Usage:
+    # Single model
+    python run_eval.py --model gpt-5.4 --save --trajectory
+    # Multiple models in parallel
+    python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 --parallel 3 --save --trajectory
+    # Specific scenario
+    python run_eval.py --model gpt-5.4 --scenario formula_repair_01
+    # OpenEnV per-step reward mode
+    python run_eval.py --model gpt-5.4 --reward-mode openenv --save --trajectory
+Prerequisites:
+    1. pip install -e .
+    2. docker build -t openenv-spreadsheet -f server/Dockerfile .
+    3. docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet
+"""
+import argparse
+import json
+import logging
+import os
+import sys
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from datetime import datetime, timezone, timedelta
+from typing import Any, Dict, List
+IST = timezone(timedelta(hours=5, minutes=30))
+from dotenv import load_dotenv
+load_dotenv(os.path.join(os.path.dirname(__file__), ".env"))
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from openenv import AutoEnv
+from agent.runner import AgentRunner
+from rewards.base import RewardBreakdown
+from rewards.checks import SpreadsheetChecker
+from rewards.transforms import SpreadsheetStepTransform
+from scenarios.definitions import SPREADSHEET_SCENARIOS
+logger = logging.getLogger(__name__)
+GYM_NAME = "spreadsheet"
+OUTPUT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "outputs")
+def _resolve_base_url() -> str:
+    import importlib.resources
+    import yaml
+    try:
+        ref = importlib.resources.files(GYM_NAME).joinpath("openenv.yaml")
+        with importlib.resources.as_file(ref) as f:
+            manifest = yaml.safe_load(f.read_text())
+            port = manifest.get("port", 8000)
+            return f"http://localhost:{port}"
+    except Exception:
+        logger.warning("Could not read openenv.yaml, defaulting to port 8000")
+        return "http://localhost:8000"
+def _fetch_gym_metadata(base_url: str) -> dict | None:
+    import httpx
+    try:
+        resp = httpx.get(f"{base_url}/metadata", timeout=5.0)
+        resp.raise_for_status()
+        data = resp.json()
+        data.pop("readme_content", None)
+        return data
+    except Exception as e:
+        logger.debug(f"Failed to fetch /metadata from {base_url}: {e}")
+        return None
+def divider(text: str = ""):
+    print(f"\n{'=' * 70}")
+    if text:
+        print(f"  {text}")
+        print(f"{'=' * 70}")
+def print_breakdown(breakdown: RewardBreakdown):
+    print(breakdown.summary())
+    print()
+    print(f"  Details: {breakdown.details}")
+def save_results_to_markdown(
+    results: List[Dict[str, Any]],
+    model: str,
+    output_path: str,
+    total_elapsed: float,
+    temperature: float,
+    run_id: str = "",
+    reward_mode: str = "custom",
+    gym_version: str = "unknown",
+):
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    timestamp = datetime.now(IST).strftime("%Y-%m-%d %H:%M:%S")
+    is_new_file = not os.path.exists(output_path)
+    with open(output_path, "a") as f:
+        if is_new_file:
+            f.write(f"# Spreadsheet Gym — Evaluation Results\n\n")
+            f.write(f"**Run ID**: `{run_id}`  \n")
+            f.write(f"**Gym Version**: `{gym_version}`\n\n")
+            f.write(f"Evaluation results for the **spreadsheet** gym across different LLM models.\n\n")
+            if reward_mode == "openenv":
+                f.write(f"**Reward Mode**: `openenv` — per-step rewards from `rewards/transforms.py` + ground truth\n\n")
+                f.write(f"Each model is evaluated on the same set of scenarios. ")
+                f.write(f"Rewards are computed using OpenEnv transforms:\n")
+                f.write(f"- **Quality** (0.25) — fraction of productive steps\n")
+                f.write(f"- **Ground Truth** (0.60) — episode outcome checks\n")
+                f.write(f"- **Efficiency** (0.15) — step budget usage\n")
+                f.write(f"- **Hallucination Penalty** — tools say success but ground truth disagrees\n\n")
+            else:
+                f.write(f"**Reward Mode**: `custom` — episode-level rewards from `rewards/base.py`\n\n")
+                f.write(f"Each model is evaluated on the same set of scenarios. ")
+                f.write(f"Rewards are computed by `rewards/base.py` using:\n")
+                f.write(f"- **Structural** (0.25) — right tools called, no errors\n")
+                f.write(f"- **Ground Truth** (0.60) — episode outcome checks\n")
+                f.write(f"- **Efficiency** (0.15) — solved in reasonable steps\n")
+                f.write(f"- **Hallucination Penalty** — tools say success but ground truth disagrees\n\n")
+            f.write(f"Trajectories: `outputs/trajectories/{run_id}/`\n\n")
+            f.write(f"---\n\n")
+        safe_model = model.replace("/", "_").replace(":", "_")
+        f.write(f"## Model: `{model}`\n\n")
+        f.write(f"- **Date**: {timestamp}\n")
+        f.write(f"- **Temperature**: {temperature}\n")
+        f.write(f"- **Reward Mode**: {reward_mode}\n")
+        f.write(f"- **Total Time**: {total_elapsed:.1f}s\n")
+        f.write(f"- **Trajectory**: `outputs/trajectories/{run_id}/{safe_model}.json`\n\n")
+        if reward_mode == "openenv":
+            f.write(f"| Scenario | Quality | Ground Truth | Penalty | **Total** | Steps | Time |\n")
+            f.write(f"|---|:---:|:---:|:---:|:---:|:---:|:---:|\n")
+        else:
+            f.write(f"| Scenario | Structural | Ground Truth | Efficiency | Penalty | **Total** | Steps | Time |\n")
+            f.write(f"|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n")
+        total_reward = 0.0
+        for r in results:
+            bd = r.get("breakdown")
+            if bd:
+                if reward_mode == "openenv":
+                    f.write(
+                        f"| {r['scenario']} "
+                        f"| {bd.structural:.2f} "
+                        f"| {bd.ground_truth:.2f} "
+                        f"| {bd.penalty:.2f} "
+                        f"| **{bd.total:.2f}** "
+                        f"| {r['steps']} "
+                        f"| {r['elapsed']:.1f}s |\n"
+                    )
+                else:
+                    f.write(
+                        f"| {r['scenario']} "
+                        f"| {bd.structural:.2f} "
+                        f"| {bd.ground_truth:.2f} "
+                        f"| {bd.efficiency:.2f} "
+                        f"| {bd.penalty:.2f} "
+                        f"| **{bd.total:.2f}** "
+                        f"| {r['steps']} "
+                        f"| {r['elapsed']:.1f}s |\n"
+                    )
+                total_reward += bd.total
+            else:
+                cols = "| — | — | — " if reward_mode == "openenv" else "| — | — | — | — "
+                f.write(
+                    f"| {r['scenario']} "
+                    f"{cols}"
+                    f"| **ERROR** "
+                    f"| {r['steps']} "
+                    f"| {r['elapsed']:.1f}s |\n"
+                )
+        avg = total_reward / len(results) if results else 0.0
+        f.write(f"\n**Average Reward: {avg:.2f}**\n\n")
+        f.write(f"---\n\n")
+    logger.info(f"Results saved to {output_path}")
+def save_trajectory(
+    results: List[Dict[str, Any]],
+    scenarios: list,
+    model: str,
+    temperature: float,
+    total_elapsed: float,
+    run_id: str = "",
+    reward_mode: str = "custom",
+    gym_version: str = "unknown",
+):
+    run_ts = datetime.now(IST).isoformat()
+    safe_model = model.replace("/", "_").replace(":", "_")
+    filename = f"{safe_model}.json"
+    traj_dir = os.path.join(OUTPUT_DIR, "trajectories", run_id)
+    os.makedirs(traj_dir, exist_ok=True)
+    filepath = os.path.join(traj_dir, filename)
+    trajectory = {
+        "run_id": run_id or "untagged",
+        "model": model,
+        "gym": GYM_NAME,
+        "gym_version": gym_version,
+        "timestamp": run_ts,
+        "temperature": temperature,
+        "reward_mode": reward_mode,
+        "total_elapsed_s": round(total_elapsed, 2),
+        "total_scenarios": len(results),
+        "scenarios": [],
+    }
+    for r, scenario in zip(results, scenarios):
+        scenario_entry = {
+            "scenario_id": scenario.id,
+            "prompt": scenario.prompt,
+            "expected_tools": scenario.expected_tools,
+            "max_steps": scenario.max_steps,
+            "elapsed_s": round(r["elapsed"], 2),
+        }
+        episode = r.get("episode")
+        if episode:
+            steps = []
+            for i, step in enumerate(episode.steps, 1):
+                result_data = step.result
+                if isinstance(result_data, str):
+                    try:
+                        result_data = json.loads(result_data)
+                    except (json.JSONDecodeError, TypeError):
+                        pass
+                steps.append({
+                    "step": i,
+                    "timestamp": step.timestamp,
+                    "tool_name": step.tool_name,
+                    "arguments": step.arguments,
+                    "success": step.success,
+                    "result": result_data,
+                    "error": step.error,
+                    "elapsed_s": round(step.elapsed, 3),
+                })
+            scenario_entry["steps"] = steps
+            scenario_entry["total_steps"] = len(steps)
+        else:
+            scenario_entry["steps"] = []
+            scenario_entry["total_steps"] = 0
+            scenario_entry["error"] = r.get("error", "Unknown error")
+        outcome_results = r.get("outcome_results", [])
+        checks = []
+        for check_def, passed in zip(scenario.outcome_checks, outcome_results):
+            checks.append({
+                "check": check_def,
+                "passed": passed,
+            })
+        scenario_entry["outcome_checks"] = checks
+        bd = r.get("breakdown")
+        if bd:
+            scenario_entry["reward"] = {
+                "structural": round(bd.structural, 4),
+                "ground_truth": round(bd.ground_truth, 4),
+                "efficiency": round(bd.efficiency, 4),
+                "penalty": round(bd.penalty, 4),
+                "total": round(bd.total, 4),
+            }
+        else:
+            scenario_entry["reward"] = None
+        trajectory["scenarios"].append(scenario_entry)
+    totals = [s["reward"]["total"] for s in trajectory["scenarios"] if s.get("reward")]
+    trajectory["avg_reward"] = round(sum(totals) / len(totals), 4) if totals else 0.0
+    with open(filepath, "w") as f:
+        json.dump(trajectory, f, indent=2, default=str)
+    print(f"\n  Trajectory saved: {filepath}")
+    logger.info(f"Trajectory saved to {filepath}")
+    return filepath
+# ── Model Workers ──
+def _run_single_model(
+    model: str,
+    base_url: str,
+    scenarios: list,
+    temperature: float,
+    max_tokens: int,
+    reward_mode: str,
+    run_id: str,
+    save: bool,
+    trajectory: bool,
+    verbose: bool,
+    gym_version: str = "unknown",
+) -> Dict[str, Any]:
+    model_start = time.time()
+    model_results = []
+    def _connect():
+        client = AutoEnv.from_env(GYM_NAME, base_url=base_url)
+        client.__enter__()
+        xform = SpreadsheetStepTransform() if reward_mode == "openenv" else None
+        rnr = AgentRunner(
+            model=model,
+            env_client=client,
+            temperature=temperature,
+            max_tokens=max_tokens,
+            reward_mode=reward_mode,
+            transform=xform,
+        )
+        return client, rnr
+    env_client, runner = _connect()
+    checker = SpreadsheetChecker()
+    WS_RETRY_ERRORS = ("ConnectionClosed", "ConnectionClosedOK", "ConnectionClosedError", "sent 1000")
+    MAX_WS_RETRIES = 3
+    try:
+        for i, scenario in enumerate(scenarios, 1):
+            print(f"\n  [{model}] Scenario {i}/{len(scenarios)}: {scenario.id}")
+            start = time.time()
+            last_error = None
+            for attempt in range(MAX_WS_RETRIES + 1):
+                try:
+                    if attempt > 0:
+                        logger.info(f"[{model}] Reconnecting (attempt {attempt + 1}) for {scenario.id}")
+                        print(f"  [{model}] Reconnecting WebSocket (attempt {attempt + 1})...")
+                        try:
+                            env_client.__exit__(None, None, None)
+                        except Exception:
+                            pass
+                        time.sleep(2 * attempt)
+                        env_client, runner = _connect()
+                    episode, breakdown = runner.run_scenario(scenario, checker)
+                    elapsed = time.time() - start
+                    if hasattr(checker, "set_episode"):
+                        checker.set_episode(episode)
+                    outcome_results = checker.check_all(scenario.outcome_checks)
+                    model_results.append({
+                        "scenario": scenario.id,
+                        "total_reward": breakdown.total,
+                        "breakdown": breakdown,
+                        "steps": len(episode.steps),
+                        "elapsed": elapsed,
+                        "episode": episode,
+                        "outcome_results": outcome_results,
+                    })
+                    print(f"  [{model}] {scenario.id}: {breakdown.total:.2f} ({len(episode.steps)} steps, {elapsed:.1f}s)")
+                    last_error = None
+                    break
+                except Exception as e:
+                    last_error = e
+                    is_ws_error = any(tok in type(e).__name__ or tok in str(e) for tok in WS_RETRY_ERRORS)
+                    if is_ws_error and attempt < MAX_WS_RETRIES:
+                        logger.warning(f"[{model}] WebSocket error on {scenario.id}: {e}")
+                        continue
+                    raise
+            if last_error is not None:
+                elapsed = time.time() - start
+                logger.exception(f"[{model}] Scenario {scenario.id} failed")
+                model_results.append({
+                    "scenario": scenario.id,
+                    "total_reward": 0.0,
+                    "breakdown": None,
+                    "steps": 0,
+                    "elapsed": elapsed,
+                    "error": str(last_error),
+                })
+                print(f"  [{model}] {scenario.id}: ERROR - {last_error}")
+    finally:
+        try:
+            env_client.__exit__(None, None, None)
+        except Exception:
+            pass
+    model_elapsed = time.time() - model_start
+    if save:
+        output_path = os.path.join(OUTPUT_DIR, "results", f"{run_id}.md")
+        save_results_to_markdown(
+            results=model_results,
+            model=model,
+            output_path=output_path,
+            total_elapsed=model_elapsed,
+            temperature=temperature,
+            run_id=run_id,
+            reward_mode=reward_mode,
+            gym_version=gym_version,
+        )
+    if trajectory:
+        save_trajectory(
+            results=model_results,
+            scenarios=scenarios,
+            model=model,
+            temperature=temperature,
+            total_elapsed=model_elapsed,
+            run_id=run_id,
+            reward_mode=reward_mode,
+            gym_version=gym_version,
+        )
+    return {
+        "model": model,
+        "results": model_results,
+        "elapsed": model_elapsed,
+    }
+def _run_single_model_detailed(
+    model: str,
+    base_url: str,
+    scenarios: list,
+    temperature: float,
+    max_tokens: int,
+    reward_mode: str,
+    run_id: str,
+    save: bool,
+    trajectory: bool,
+    gym_version: str = "unknown",
+) -> Dict[str, Any]:
+    model_start = time.time()
+    results = []
+    env_client = AutoEnv.from_env(GYM_NAME, base_url=base_url)
+    env_client.__enter__()
+    checker = SpreadsheetChecker()
+    transform = SpreadsheetStepTransform() if reward_mode == "openenv" else None
+    runner = AgentRunner(
+        model=model,
+        env_client=env_client,
+        temperature=temperature,
+        max_tokens=max_tokens,
+        reward_mode=reward_mode,
+        transform=transform,
+    )
+    try:
+        for i, scenario in enumerate(scenarios, 1):
+            divider(f"Scenario {i}/{len(scenarios)}: {scenario.id}")
+            print(f"  Prompt: {scenario.prompt[:120]}...")
+            print(f"  Expected tools: {scenario.expected_tools}")
+            print(f"  Max steps: {scenario.max_steps}")
+            print()
+            start = time.time()
+            try:
+                episode, breakdown = runner.run_scenario(scenario, checker)
+                elapsed = time.time() - start
+                print()
+                print("  -- Agent Actions --")
+                for step in episode.steps:
+                    status = "OK" if step.success else "FAIL"
+                    args_str = _short_json(step.arguments)
+                    print(f"  [{status}] {step.tool_name}({args_str})")
+                print(f"  Steps taken: {len(episode.steps)}")
+                if hasattr(checker, "set_episode"):
+                    checker.set_episode(episode)
+                print()
+                print("  -- Ground Truth Verification --")
+                outcome_results = checker.check_all(scenario.outcome_checks)
+                for check, score in zip(scenario.outcome_checks, outcome_results):
+                    status = "PASS" if score else "FAIL"
+                    label = _check_label(check)
+                    print(f"  [{status}] {check['type']}: {label}")
+                print()
+                print("  -- Reward Breakdown --")
+                print_breakdown(breakdown)
+                print(f"\n  Completed in {elapsed:.1f}s")
+                results.append({
+                    "scenario": scenario.id,
+                    "total_reward": breakdown.total,
+                    "breakdown": breakdown,
+                    "steps": len(episode.steps),
+                    "elapsed": elapsed,
+                    "episode": episode,
+                    "outcome_results": outcome_results,
+                })
+            except Exception as e:
+                elapsed = time.time() - start
+                print(f"\n  ERROR: {e}")
+                logger.exception(f"Scenario {scenario.id} failed")
+                results.append({
+                    "scenario": scenario.id,
+                    "total_reward": 0.0,
+                    "breakdown": None,
+                    "steps": 0,
+                    "elapsed": elapsed,
+                    "error": str(e),
+                })
+    finally:
+        env_client.__exit__(None, None, None)
+        logger.info("AutoEnv client disconnected.")
+    model_elapsed = time.time() - model_start
+    if save:
+        output_path = os.path.join(OUTPUT_DIR, "results", f"{run_id}.md")
+        save_results_to_markdown(
+            results=results,
+            model=model,
+            output_path=output_path,
+            total_elapsed=model_elapsed,
+            temperature=temperature,
+            run_id=run_id,
+            reward_mode=reward_mode,
+            gym_version=gym_version,
+        )
+        print(f"\n  Results saved: {output_path}")
+    if trajectory:
+        save_trajectory(
+            results=results,
+            scenarios=scenarios,
+            model=model,
+            temperature=temperature,
+            total_elapsed=model_elapsed,
+            run_id=run_id,
+            reward_mode=reward_mode,
+            gym_version=gym_version,
+        )
+    return {
+        "model": model,
+        "results": results,
+        "elapsed": model_elapsed,
+    }
+def _check_label(check: dict) -> str:
+    for key in ("min_score", "min_pct", "max_hits"):
+        if key in check and key != "type":
+            return str(check[key])
+    return check.get("type", "?")
+def _short_json(obj, max_len=80):
+    s = json.dumps(obj, default=str)
+    return s if len(s) <= max_len else s[:max_len] + "..."
+def main():
+    parser = argparse.ArgumentParser(
+        description="Evaluate an LLM agent against Spreadsheet gym scenarios.",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  python run_eval.py --model gpt-5.4 --save --trajectory
+  python run_eval.py --model gpt-5.4,claude-sonnet-4-6 --parallel 2 --reward-mode openenv
+  python run_eval.py --model gpt-5.4 --scenario formula_repair_01
+        """,
+    )
+    parser.add_argument(
+        "--model",
+        default=os.getenv("LLM_MODEL", "gpt-4o"),
+        help="LiteLLM model string, or comma-separated for parallel mode "
+             "(e.g., 'gpt-5.4' or 'gpt-5.4,claude-sonnet-4-6')",
+    )
+    parser.add_argument(
+        "--scenario",
+        default=None,
+        help="Run a specific scenario by ID (default: run all 12)",
+    )
+    parser.add_argument(
+        "--temperature",
+        type=float,
+        default=float(os.getenv("LLM_TEMPERATURE", "0.0")),
+        help="LLM sampling temperature (default: 0.0)",
+    )
+    parser.add_argument(
+        "--max-tokens",
+        type=int,
+        default=int(os.getenv("LLM_MAX_TOKENS", "1024")),
+        help="Max tokens per LLM response (default: 1024)",
+    )
+    parser.add_argument(
+        "--save",
+        action="store_true",
+        help="Save results to outputs/results/<run_id>.md",
+    )
+    parser.add_argument(
+        "--trajectory",
+        action="store_true",
+        help="Save detailed trajectory JSON to outputs/trajectories/<run_id>/",
+    )
+    parser.add_argument(
+        "--run-id",
+        default=None,
+        help="Run identifier (default: auto-generated as run_YYYYMMDD_HHMM)",
+    )
+    parser.add_argument(
+        "--reward-mode",
+        default="custom",
+        choices=["custom", "openenv"],
+        help="Reward mode: 'custom' (episode-level) or 'openenv' (per-step). Default: custom",
+    )
+    parser.add_argument(
+        "--parallel",
+        type=int,
+        default=1,
+        help="Number of models to evaluate in parallel (default: 1 = sequential)",
+    )
+    parser.add_argument(
+        "--verbose", "-v",
+        action="store_true",
+        help="Enable debug logging",
+    )
+    args = parser.parse_args()
+    models = [m.strip() for m in args.model.split(",") if m.strip()]
+    if args.run_id:
+        run_id = args.run_id
+    else:
+        run_id = f"run_{datetime.now(IST).strftime('%Y%m%d_%H%M')}"
+    log_level = logging.DEBUG if args.verbose else logging.INFO
+    logging.basicConfig(
+        level=log_level,
+        format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
+        datefmt="%H:%M:%S",
+    )
+    base_url = _resolve_base_url()
+    scenarios = SPREADSHEET_SCENARIOS
+    if args.scenario:
+        scenarios = [s for s in scenarios if s.id == args.scenario]
+        if not scenarios:
+            available = [s.id for s in SPREADSHEET_SCENARIOS]
+            print(f"Error: Scenario '{args.scenario}' not found. Available: {available}")
+            sys.exit(1)
+    divider("AutoEnv Discovery")
+    print(f"  Discovering gym '{GYM_NAME}' via AutoEnv...")
+    env_info = AutoEnv.get_env_info(GYM_NAME)
+    print(f"  Found: {env_info['name']} (package: {env_info['package']}, v{env_info['version']})")
+    print(f"  Base URL: {base_url} (auto-derived from openenv.yaml)")
+    gym_metadata = _fetch_gym_metadata(base_url)
+    if gym_metadata:
+        print(f"\n  -- Environment Metadata (GET {base_url}/metadata) --")
+        print(f"  Name:        {gym_metadata.get('name', 'N/A')}")
+        print(f"  Version:     {gym_metadata.get('version', 'N/A')}")
+        print(f"  Description: {gym_metadata.get('description', 'N/A')}")
+    else:
+        print(f"\n  Warning: Could not fetch /metadata from {base_url} (server may not be running)")
+    is_parallel = args.parallel > 1 and len(models) > 1
+    mode_str = f"Parallel ({args.parallel} workers)" if is_parallel else "Sequential"
+    gym_version = gym_metadata.get("version", "unknown") if gym_metadata else "unknown"
+    divider("LLM Evaluation Run")
+    print(f"  Gym:          {GYM_NAME} (v{gym_version})")
+    print(f"  Models:       {', '.join(models)}")
+    print(f"  Run ID:       {run_id}")
+    print(f"  Mode:         {mode_str}")
+    print(f"  Base URL:     {base_url}")
+    print(f"  Scenarios:    {len(scenarios)} of {len(SPREADSHEET_SCENARIOS)}")
+    print(f"  Temperature:  {args.temperature}")
+    print(f"  Reward Mode:  {args.reward_mode}")
+    print(f"  Output Dir:   {OUTPUT_DIR}")
+    total_start = time.time()
+    all_model_results = []
+    if is_parallel:
+        divider(f"Parallel Evaluation ({len(models)} models, {args.parallel} workers)")
+        max_workers = min(args.parallel, len(models))
+        with ThreadPoolExecutor(max_workers=max_workers) as executor:
+            futures = {}
+            for idx, model in enumerate(models):
+                if idx > 0:
+                    time.sleep(3)
+                future = executor.submit(
+                    _run_single_model,
+                    model=model,
+                    base_url=base_url,
+                    scenarios=scenarios,
+                    temperature=args.temperature,
+                    max_tokens=args.max_tokens,
+                    reward_mode=args.reward_mode,
+                    run_id=run_id,
+                    save=args.save,
+                    trajectory=args.trajectory,
+                    verbose=args.verbose,
+                    gym_version=gym_version,
+                )
+                futures[future] = model
+            for future in as_completed(futures):
+                model = futures[future]
+                try:
+                    result = future.result()
+                    all_model_results.append(result)
+                    print(f"\n  {model} completed in {result['elapsed']:.1f}s")
+                except Exception as e:
+                    print(f"\n  {model} FAILED: {e}")
+                    logger.exception(f"Model {model} failed")
+                    all_model_results.append({
+                        "model": model,
+                        "results": [],
+                        "elapsed": 0.0,
+                        "error": str(e),
+                    })
+    else:
+        for model in models:
+            if len(models) > 1:
+                divider(f"Model: {model}")
+            if len(models) == 1:
+                result = _run_single_model_detailed(
+                    model=model,
+                    base_url=base_url,
+                    scenarios=scenarios,
+                    temperature=args.temperature,
+                    max_tokens=args.max_tokens,
+                    reward_mode=args.reward_mode,
+                    run_id=run_id,
+                    save=args.save,
+                    trajectory=args.trajectory,
+                    gym_version=gym_version,
+                )
+            else:
+                result = _run_single_model(
+                    model=model,
+                    base_url=base_url,
+                    scenarios=scenarios,
+                    temperature=args.temperature,
+                    max_tokens=args.max_tokens,
+                    reward_mode=args.reward_mode,
+                    run_id=run_id,
+                    save=args.save,
+                    trajectory=args.trajectory,
+                    verbose=args.verbose,
+                    gym_version=gym_version,
+                )
+            all_model_results.append(result)
+    total_elapsed = time.time() - total_start
+    divider("Evaluation Summary")
+    for mr in all_model_results:
+        model = mr["model"]
+        results = mr.get("results", [])
+        model_elapsed = mr.get("elapsed", 0.0)
+        if not results:
+            print(f"\n  Model: {model} -- FAILED ({mr.get('error', 'unknown')})")
+            continue
+        total_reward = sum(r["total_reward"] for r in results)
+        avg_reward = total_reward / len(results) if results else 0.0
+        print(f"\n  Model: {model}")
+        print(f"  Time:  {model_elapsed:.1f}s")
+        print(f"  {'Scenario':<35} {'Reward':>8} {'Steps':>6} {'Time':>6}")
+        print(f"  {'-' * 35} {'-' * 8} {'-' * 6} {'-' * 6}")
+        for r in results:
+            reward_str = f"{r['total_reward']:.2f}" if r.get("breakdown") else "ERROR"
+            print(f"  {r['scenario']:<35} {reward_str:>8} {r['steps']:>6} {r['elapsed']:>5.1f}s")
+        print(f"  {'-' * 35} {'-' * 8} {'-' * 6} {'-' * 6}")
+        print(f"  {'AVERAGE':<35} {avg_reward:>8.2f}")
+    if len(models) > 1:
+        print(f"\n  Total time (all models): {total_elapsed:.1f}s")
+        if is_parallel:
+            seq_time = sum(mr.get("elapsed", 0.0) for mr in all_model_results)
+            speedup = seq_time / total_elapsed if total_elapsed > 0 else 1.0
+            print(f"  Sequential equivalent:   {seq_time:.1f}s")
+            print(f"  Speedup:                 {speedup:.1f}x")
+if __name__ == "__main__":
+    main()

scenarios/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from .definitions import SPREADSHEET_SCENARIOS
2	+
3	+ __all__ = ["SPREADSHEET_SCENARIOS"]

scenarios/definitions.py ADDED Viewed

	@@ -0,0 +1,69 @@

+"""Scenario loader for the Spreadsheet gym (used by run_eval.py)."""
+from __future__ import annotations
+import json
+from pathlib import Path
+from ..rewards.base import Scenario
+_PKG_ROOT = Path(__file__).resolve().parent.parent
+_SCENARIOS_DIR = _PKG_ROOT / "scenarios"
+_HIDDEN_TESTS_DIR = _PKG_ROOT / "workbooks" / "hidden_tests"
+_BASE_TOOLS = ["load_scenario", "list_sheets", "read_range", "submit_workbook"]
+_CATEGORY_TOOLS: dict[str, list[str]] = {
+    "formula_repair": _BASE_TOOLS + ["inspect_formula", "write_cell", "validate_partial"],
+    "cross_sheet_lookup": _BASE_TOOLS + ["write_cell", "write_range"],
+    "messy_table_extraction": _BASE_TOOLS + ["write_range"],
+    "schedule_grid_fill": _BASE_TOOLS + ["write_cell", "write_range", "validate_partial"],
+    "ledger_reconciliation": _BASE_TOOLS + ["write_range"],
+    "range_transformation": _BASE_TOOLS + ["write_range"],
+    "conditional_aggregation": _BASE_TOOLS + ["write_range"],
+    "buggy_template_fix": _BASE_TOOLS + ["inspect_formula", "write_cell", "write_range", "validate_partial"],
+}
+_DEFAULT_TOOLS = _BASE_TOOLS + ["write_cell", "write_range", "validate_partial"]
+def _load_scenarios_from_json() -> list[Scenario]:
+    if not _SCENARIOS_DIR.is_dir():
+        return []
+    scenarios = []
+    for f in sorted(_SCENARIOS_DIR.glob("*.json")):
+        data = json.loads(f.read_text(encoding="utf-8"))
+        sid = data.get("id", f.stem)
+        prompt = data.get("instructions", data.get("description", ""))
+        if not prompt.lower().startswith("load scenario"):
+            prompt = f"Load scenario '{sid}'. {prompt}"
+        category = data.get("category", "")
+        expected_tools = _CATEGORY_TOOLS.get(category, _DEFAULT_TOOLS)
+        outcome_checks = []
+        hidden_test_path = _HIDDEN_TESTS_DIR / f"{sid}.json"
+        if hidden_test_path.is_file():
+            ht = json.loads(hidden_test_path.read_text(encoding="utf-8"))
+            checks = ht.get("checks", [])
+            outcome_checks.append({
+                "type": "hidden_test_pass_rate",
+                "total_checks": len(checks),
+                "min_pass_rate": 0.5,
+            })
+        scenarios.append(Scenario(
+            id=sid,
+            prompt=prompt,
+            expected_tools=expected_tools,
+            max_steps=data.get("max_steps", 50),
+            outcome_checks=outcome_checks,
+        ))
+    return scenarios
+SPREADSHEET_SCENARIOS = _load_scenarios_from_json()
+__all__ = ["SPREADSHEET_SCENARIOS"]

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,41 @@

+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git curl && \
+    rm -rf /var/lib/apt/lists/*
+WORKDIR /app
+COPY . /app/env
+WORKDIR /app/env
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then uv sync --frozen --no-install-project --no-editable; \
+    else uv sync --no-install-project --no-editable; fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then uv sync --frozen --no-editable; \
+    else uv sync --no-editable; fi
+FROM ${BASE_IMAGE}
+WORKDIR /app
+COPY --from=builder /app/env/.venv /app/.venv
+COPY --from=builder /app/env /app/env
+ENV PATH="/app/.venv/bin:$PATH"
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+ENV ENABLE_WEB_INTERFACE=true
+ENV WORKBOOKS_DIR=/app/env/workbooks
+ENV SCENARIOS_DIR=/app/env/scenarios
+EXPOSE 8000
+HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
+    CMD curl -sf http://localhost:8000/health || exit 1
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

server/app.py CHANGED Viewed

@@ -6,6 +6,9 @@ import os
 import sys
 from pathlib import Path
 try:
     from openenv.core.env_server.http_server import create_app
 except ImportError as e:
@@ -13,6 +16,76 @@ except ImportError as e:
         "openenv is required. Install with: uv sync"
     ) from e
 try:
     from spreadsheet.models import SpreadsheetAction, SpreadsheetObservation
     from spreadsheet.server.spreadsheet_environment import SpreadsheetEnvironment
@@ -31,6 +104,13 @@ app = create_app(
     max_concurrent_envs=MAX_CONCURRENT_ENVS,
 )
 def main(host: str = "0.0.0.0", port: int = 8000):
     import uvicorn

 import sys
 from pathlib import Path
+from dotenv import load_dotenv
+from fastapi.middleware.cors import CORSMiddleware
 try:
     from openenv.core.env_server.http_server import create_app
 except ImportError as e:
         "openenv is required. Install with: uv sync"
     ) from e
+load_dotenv(os.path.join(os.path.dirname(__file__), "..", ".env"))
+import openenv.core.env_server.web_interface as _wi  # noqa: E402
+_wi.DEFAULT_QUICK_START_MARKDOWN = """
+### How to use this environment
+**Spreadsheet** — exact workbook manipulation and reasoning over realistic spreadsheet tasks. Read sheets, understand structure, write values/formulas, and submit for automated evaluation.
+Use the **Playground** on the right. Type a **Tool Name** and **Arguments Json**, then click **Step**.
+---
+#### 1. Start a session
+1. Click **Reset**
+2. `list_tools` → `{}` — discover all 13 tools & their params
+3. `list_scenarios` → `{}` — see all 12 scenarios
+4. `load_scenario` → `{"scenario_id": "formula_repair_01"}`
+#### 2. Explore the workbook
+- `list_sheets` → `{}` — sheet names, dimensions, visibility
+- `read_range` → `{"sheet": "Summary", "range": "A1:F10"}` — read cells
+- `inspect_formula` → `{"sheet": "Summary", "cell": "C5"}` — raw formula string
+- `list_named_targets` → `{}` — allowed output zones
+- `get_session_info` → `{}` — session metadata, step count
+- `get_edit_history` → `{}` — all edits so far
+> **Note:** `read_range` uses **A1 notation** (e.g. `"B2:D10"`). Formulas are returned as strings.
+#### 3. Edit cells
+- `write_cell` → `{"sheet": "Summary", "cell": "C5", "value": "=SUM(B2:B10)"}` — write one cell
+- `write_range` → `{"sheet": "Summary", "start_cell": "A1", "data": "[[1, 2], [3, 4]]"}` — write a block
+> **Note:** `write_range` uses **start_cell** (not `cell`). The `data` arg is a JSON string of a 2D array.
+> Values starting with `=` are treated as formulas. Numeric strings are auto-converted.
+#### 4. Validate & submit
+- `validate_partial` → `{}` — check progress (pass/fail count, no answers revealed)
+- `submit_workbook` → `{}` — final evaluation (pass rate + per-check results)
+- `reset_scenario` → `{}` — restore workbook to original (scenario stays loaded)
+> Use `validate_partial` before `submit_workbook` to gauge progress without ending the task.
+---
+#### Scenarios (12)
+`formula_repair_01` · `formula_repair_02` · `cross_sheet_lookup_01` · `cross_sheet_lookup_02` · `conditional_aggregation_01` · `conditional_aggregation_02` · `ledger_reconciliation_01` · `ledger_reconciliation_02` · `messy_table_extraction_01` · `range_transformation_01` · `schedule_grid_fill_01` · `buggy_template_fix_01`
+#### Connect from Python
+```python
+from spreadsheet import SpreadsheetAction, SpreadsheetEnv
+env = SpreadsheetEnv(base_url="http://localhost:8000")
+obs = env.reset()
+obs = await env.step(SpreadsheetAction(
+    tool_name="load_scenario",
+    arguments_json='{"scenario_id": "formula_repair_01"}'
+))
+```
+For more, see the [OpenEnv documentation](https://meta-pytorch.org/OpenEnv/).
+"""
 try:
     from spreadsheet.models import SpreadsheetAction, SpreadsheetObservation
     from spreadsheet.server.spreadsheet_environment import SpreadsheetEnvironment
     max_concurrent_envs=MAX_CONCURRENT_ENVS,
 )
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
 def main(host: str = "0.0.0.0", port: int = 8000):
     import uvicorn

workbooks/fixtures/13597ec4-95ae-4293-a2d1-aec276ac80e9_sales_commission.xlsx ADDED Viewed

Binary file (7.79 kB). View file

workbooks/fixtures/6dd7822d-39b9-4134-80ad-b7e653ad9944_product_revenue_by_region.xlsx ADDED Viewed

Binary file (11.8 kB). View file

workbooks/fixtures/8df5e07f-7a7d-4911-86bd-2e102df0cc7b_multi_department_budget.xlsx ADDED Viewed

Binary file (9.1 kB). View file