kdemon1011 commited on
Commit
fded8f2
Β·
verified Β·
1 Parent(s): 6b4e5a8

Upload folder using huggingface_hub

Browse files
.dockerignore ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Secrets β€” NEVER include in Docker image
2
+ .env
3
+ .env.local
4
+ .env.production
5
+
6
+ # Python artifacts
7
+ __pycache__/
8
+ *.pyc
9
+ *.pyo
10
+ *.egg-info/
11
+ .venv/
12
+
13
+ # Git
14
+ .git/
15
+ .gitignore
16
+
17
+ # Test / lint caches
18
+ .pytest_cache/
19
+ .mypy_cache/
20
+ .ruff_cache/
21
+
22
+ # Evaluation outputs (internal, not part of the env)
23
+ outputs/
24
+ trajectories/
25
+ results/
26
+ comparison.md
27
+ *.md.bak
28
+ generate_scenarios.py
29
+
30
+ # IDE
31
+ .cursor/
32
+ .vscode/
33
+ .idea/
34
+
35
+ # OS files
36
+ .DS_Store
37
+ Thumbs.db
.env.example ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ── Environment Server Configuration ──
2
+ OPENENV_PORT=8000
3
+ MAX_CONCURRENT_ENVS=8
4
+ ENABLE_WEB_INTERFACE=true
5
+ WORKBOOKS_DIR=workbooks
6
+ SCENARIOS_DIR=scenarios
7
+
8
+ # ── LLM Configuration (used by run_eval.py) ──
9
+ LLM_MODEL=gpt-4o
10
+ LLM_TEMPERATURE=0.0
11
+ LLM_MAX_TOKENS=1024
12
+
13
+ # ── API Keys ──
14
+ # Only the key for your chosen --model provider is required.
15
+
16
+ # OpenAI (for gpt-4o, gpt-5.4, o3-pro, etc.)
17
+ OPENAI_API_KEY=
18
+ OPENAI_API_BASE=https://api.openai.com/v1
19
+
20
+ # Anthropic (for claude-sonnet-4-6, claude-opus-4-6, etc.)
21
+ ANTHROPIC_API_KEY=
22
+
23
+ # Google (for gemini-2.5-pro, etc.)
24
+ GOOGLE_API_KEY=
25
+
26
+ # For local models via Ollama β€” no key needed, just run:
27
+ # ollama serve && ollama pull llama3
28
+ # Then use: --model ollama/llama3
Dockerfile CHANGED
@@ -32,6 +32,14 @@ ENV PYTHONPATH="/app/env:$PYTHONPATH"
32
  ENV ENABLE_WEB_INTERFACE=true
33
  ENV WORKBOOKS_DIR=/app/env/workbooks
34
  ENV SCENARIOS_DIR=/app/env/scenarios
 
 
 
 
 
 
 
 
35
 
36
  EXPOSE 8000
37
 
 
32
  ENV ENABLE_WEB_INTERFACE=true
33
  ENV WORKBOOKS_DIR=/app/env/workbooks
34
  ENV SCENARIOS_DIR=/app/env/scenarios
35
+ ENV SPACE_ID=huzzle-labs/spreadsheet
36
+
37
+ RUN python -c "\
38
+ import re, pathlib;\
39
+ src = pathlib.Path('/app/env/README.md').read_text();\
40
+ clean = re.sub(r'^---\n.*?\n---\n', '', src, count=1, flags=re.DOTALL);\
41
+ pathlib.Path('/app/env/.README_web.md').write_text(clean)"
42
+ ENV ENV_README_PATH=/app/env/.README_web.md
43
 
44
  EXPOSE 8000
45
 
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: "Spreadsheet Environment Server"
3
  emoji: πŸ“Š
4
  colorFrom: green
5
  colorTo: blue
@@ -10,74 +10,468 @@ app_port: 8000
10
  base_path: /web
11
  tags:
12
  - openenv
 
13
  - rl-environment
14
  ---
15
 
16
- # Spreadsheet Environment
17
 
18
- Exact workbook manipulation and reasoning over realistic spreadsheet tasks. This gym targets weaknesses in structured state tracking, cross-sheet reasoning, non-standard table layouts, and exact edit correctness.
19
 
20
- ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ```bash
23
- cd spreadsheet && docker build -t openenv-spreadsheet -f server/Dockerfile .
 
 
 
 
24
  docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet
 
 
25
  curl http://localhost:8000/health
 
 
 
26
  ```
27
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  ```python
29
- from spreadsheet import SpreadsheetEnv
30
 
31
- with SpreadsheetEnv(base_url="http://localhost:8000") as env:
32
- result = env.reset()
33
- # Use MCP tools: list_sheets, read_range, write_cell, submit_workbook, etc.
 
 
 
 
 
 
 
 
 
 
 
34
  ```
35
 
36
- ## Project Structure
37
 
 
 
38
  ```
39
- spreadsheet/
40
- β”œβ”€β”€ __init__.py
41
- β”œβ”€β”€ client.py
42
- β”œβ”€β”€ models.py
43
- β”œβ”€β”€ openenv.yaml
44
- β”œβ”€β”€ pyproject.toml
45
- β”œβ”€β”€ README.md
46
- β”œβ”€β”€ .env
47
- β”œβ”€β”€ .dockerignore
48
- β”œβ”€β”€ uv.lock
49
- β”œβ”€β”€ server/
50
- β”‚ β”œβ”€β”€ __init__.py
51
- β”‚ β”œβ”€β”€ app.py
52
- β”‚ β”œβ”€β”€ spreadsheet_environment.py
53
- β”‚ β”œβ”€β”€ workbook_engine.py
54
- β”‚ β”œβ”€β”€ formula_utils.py
55
- β”‚ └── scenario_loader.py
56
- β”œβ”€β”€ workbooks/
57
- β”‚ β”œβ”€β”€ templates/
58
- β”‚ β”œβ”€β”€ fixtures/
59
- β”‚ └── hidden_tests/
60
- β”œβ”€β”€ scenarios/
61
- └── server/Dockerfile
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ```
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  ## Reward System
65
 
66
- Both reward modes use a unified scoring formula:
 
 
 
 
 
 
 
 
 
 
 
67
 
68
  ```
69
  total = 0.25 Γ— quality + 0.15 Γ— efficiency + 0.60 Γ— ground_truth + penalty
70
  ```
71
 
72
- - **Quality (0.25)** β€” Custom mode: F1 of expected vs used tools + success rate. OpenEnv mode: fraction of non-neutral steps that were productive (sign-based).
73
- - **Efficiency (0.15)** β€” `1.0 - (actual_steps / max_steps)`. Fewer steps = higher score.
74
- - **Ground Truth (0.60)** β€” Outcome checks verified against submit_workbook hidden test results (pass rate of cell/formula checks).
75
- - **Penalty** β€” Graduated: -0.5 (all calls succeed, 0% ground truth) or -0.2 (<30% ground truth).
 
 
 
 
 
 
76
 
77
- See [Reward System](../docs/reward-system.md) for full details.
78
 
79
- ## Deployment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
  ```bash
82
- openenv push . --private --repo-id huzzle-labs/spreadsheet
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: "Spreadsheet Environment"
3
  emoji: πŸ“Š
4
  colorFrom: green
5
  colorTo: blue
 
10
  base_path: /web
11
  tags:
12
  - openenv
13
+ - openenv-0.2.3
14
  - rl-environment
15
  ---
16
 
17
+ # Spreadsheet Gym
18
 
19
+ **Exact workbook manipulation and reasoning over realistic spreadsheet tasks.**
20
 
21
+ An OpenEnv RL environment where agents must read, understand, and edit real `.xlsx` workbooks to solve structured tasks β€” formula repair, cross-sheet lookups, ledger reconciliation, messy data extraction, and more. Designed to stress structured state tracking, cross-sheet reasoning, non-standard table layouts, and exact edit correctness β€” areas where frontier LLMs consistently struggle.
22
+
23
+ ## Playground Quick Start
24
+
25
+ Use the **Playground** panel (right side) to interact with the environment. Type a **Tool Name** and **Arguments Json**, then click **Step**.
26
+
27
+ ### Typical workflow
28
+
29
+ 1. Click **Reset** to start a fresh session
30
+ 2. Enter `list_tools` (args: `{}`) β†’ discover all available tools and their parameters
31
+ 3. Enter `list_scenarios` (args: `{}`) β†’ see all 12 scenarios
32
+ 4. Enter `load_scenario` (args: `{"scenario_id": "formula_repair_01"}`) β†’ start a task
33
+ 5. Enter `list_sheets` (args: `{}`) β†’ see all sheets in the workbook
34
+ 6. Enter `read_range` (args: `{"sheet": "Summary", "range": "A1:F10"}`) β†’ read cell values
35
+ 7. Enter `inspect_formula` (args: `{"sheet": "Summary", "cell": "C5"}`) β†’ see the raw formula
36
+ 8. Enter `write_cell` (args: `{"sheet": "Summary", "cell": "C5", "value": "=SUM(B2:B10)"}`) β†’ fix a formula
37
+ 9. Enter `validate_partial` (args: `{}`) β†’ check how many hidden tests pass so far
38
+ 10. Enter `submit_workbook` (args: `{}`) β†’ submit for final evaluation (ends the task)
39
+
40
+ ### All tool commands (copy-paste ready)
41
+
42
+ #### Discovery & session tools
43
+
44
+ | Tool Name | Arguments Json | Description |
45
+ |-----------|---------------|-------------|
46
+ | `list_tools` | `{}` | List every available tool with its parameters and types |
47
+ | `get_session_info` | `{}` | Current session ID, loaded scenario, step count, edit count, solve status |
48
+ | `list_scenarios` | `{}` | List all 12 scenarios with description, workbook name, and max steps |
49
+ | `load_scenario` | `{"scenario_id": "formula_repair_01"}` | Load a scenario and its workbook to begin working |
50
+ | `reset_scenario` | `{}` | Restore workbook to original state, keeping the scenario loaded |
51
+
52
+ #### Reading tools
53
+
54
+ | Tool Name | Arguments Json | Description |
55
+ |-----------|---------------|-------------|
56
+ | `list_sheets` | `{}` | List all sheets with names, dimensions, and visibility |
57
+ | `read_range` | `{"sheet": "Summary", "range": "B2:D10"}` | Read a rectangular range of cells (formulas shown as strings) |
58
+ | `inspect_formula` | `{"sheet": "Summary", "cell": "C15"}` | Return the raw formula string from a cell |
59
+ | `list_named_targets` | `{}` | Show target areas and allowed output zones for the scenario |
60
+
61
+ #### Writing tools
62
+
63
+ | Tool Name | Arguments Json | Description |
64
+ |-----------|---------------|-------------|
65
+ | `write_cell` | `{"sheet": "Summary", "cell": "C15", "value": "=SUM(B2:B10)"}` | Write a value or formula to a single cell |
66
+ | `write_range` | `{"sheet": "Summary", "start_cell": "A1", "data": "[[1, 2], [3, 4]]"}` | Write a 2D block of values starting from a cell |
67
+
68
+ > **Note:** `write_range` takes `start_cell` (not `cell`). The `data` argument is a JSON string of a 2D array.
69
+
70
+ #### Validation & submission tools
71
+
72
+ | Tool Name | Arguments Json | Description |
73
+ |-----------|---------------|-------------|
74
+ | `validate_partial` | `{}` | Check partial progress β€” how many hidden tests pass/fail (no answers revealed) |
75
+ | `submit_workbook` | `{}` | Submit for final evaluation β€” returns pass rate and per-check results |
76
+
77
+ #### History tools
78
+
79
+ | Tool Name | Arguments Json | Description |
80
+ |-----------|---------------|-------------|
81
+ | `get_edit_history` | `{}` | Full list of cell edits: sheet, cell, value, step number |
82
+
83
+ ### Important notes
84
+
85
+ - All string parameters are required β€” no optional arguments on any tool
86
+ - `write_cell` values starting with `=` are treated as formulas (e.g. `"=VLOOKUP(A2,Sheet2!A:B,2,FALSE)"`)
87
+ - `write_range` data must be a JSON string: `"[[1, 2], [3, 4]]"` not `[[1, 2], [3, 4]]`
88
+ - Writing outside target regions incurs a reward penalty
89
+ - Use `validate_partial` before `submit_workbook` to check progress without ending the task
90
+
91
+ ### Run locally
92
 
93
  ```bash
94
+ cd spreadsheet
95
+ pip install -e .
96
+
97
+ # Start the environment server
98
+ docker build -t openenv-spreadsheet -f Dockerfile .
99
  docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet
100
+
101
+ # Verify it's running
102
  curl http://localhost:8000/health
103
+
104
+ # Open the playground in your browser
105
+ open http://localhost:8000/web/
106
  ```
107
 
108
+ ## Hugging Face Space Deployment
109
+
110
+ This Space is built from OpenEnV environment `spreadsheet`.
111
+
112
+ - **Space URL**: `https://huggingface.co/spaces/huzzle-labs/spreadsheet`
113
+ - **OpenEnV pinned ref**: `0.2.3`
114
+ - **Hub tag**: `openenv`
115
+
116
+ ### Connecting from Code
117
+
118
+ Connect using the `SpreadsheetEnv` client:
119
+
120
  ```python
121
+ from spreadsheet import SpreadsheetAction, SpreadsheetEnv
122
 
123
+ with SpreadsheetEnv.from_env("huzzle-labs/spreadsheet") as env:
124
+ obs = env.reset()
125
+ obs = await env.step(SpreadsheetAction(
126
+ tool_name="list_scenarios",
127
+ arguments_json="{}"
128
+ ))
129
+ obs = await env.step(SpreadsheetAction(
130
+ tool_name="load_scenario",
131
+ arguments_json='{"scenario_id": "formula_repair_01"}'
132
+ ))
133
+ obs = await env.step(SpreadsheetAction(
134
+ tool_name="read_range",
135
+ arguments_json='{"sheet": "Summary", "range": "A1:F10"}'
136
+ ))
137
  ```
138
 
139
+ Or connect directly to a running server:
140
 
141
+ ```python
142
+ env = SpreadsheetEnv(base_url="https://huzzle-labs-spreadsheet.hf.space")
143
  ```
144
+
145
+ ## What Is This Gym?
146
+
147
+ The Spreadsheet gym gives an LLM agent a real `.xlsx` workbook and a task description. The agent must use MCP tools to read sheets, understand the structure, write values or formulas, and submit the workbook for automated evaluation against hidden test checks. Every edit is tracked, and the agent must stay within target regions and step budgets.
148
+
149
+ Unlike typical code-generation or QA benchmarks, this gym requires:
150
+
151
+ - **Structured state tracking** β€” understanding multi-sheet workbook layouts with varying column structures
152
+ - **Cross-sheet reasoning** β€” performing lookups, aggregations, and reconciliations across sheets
153
+ - **Exact edit correctness** β€” writing precise formulas and values that pass deterministic hidden tests
154
+ - **Strategic tool use** β€” using `validate_partial` to check progress before committing with `submit_workbook`
155
+
156
+ ## Task Families (12 Scenarios)
157
+
158
+ ### Formula Repair (2 scenarios)
159
+ Fix broken formulas in multi-department workbooks. Diagnose incorrect references, cascading errors, and wrong aggregation functions.
160
+
161
+ | Scenario | Description | Max Steps |
162
+ |---|---|---|
163
+ | `formula_repair_01` | Fix broken formulas in a multi-department budget workbook | 50 |
164
+ | `formula_repair_02` | Fix cascading formula errors in a 5-year financial projection | 50 |
165
+
166
+ ### Cross-Sheet Lookup (2 scenarios)
167
+ Aggregate data across multiple sheets using lookups and cross-references.
168
+
169
+ | Scenario | Description | Max Steps |
170
+ |---|---|---|
171
+ | `cross_sheet_lookup_01` | Aggregate product revenue by region/category across quarterly sheets | 50 |
172
+ | `cross_sheet_lookup_02` | Calculate employee bonuses by cross-referencing Employees and Bonus_Tiers | 50 |
173
+
174
+ ### Conditional Aggregation (2 scenarios)
175
+ Apply tiered calculations with conditional logic and priority-based allocation.
176
+
177
+ | Scenario | Description | Max Steps |
178
+ |---|---|---|
179
+ | `conditional_aggregation_01` | Calculate tiered sales commissions for 15 salespeople | 50 |
180
+ | `conditional_aggregation_02` | Allocate a fixed budget across 20 requests with priority-based rates | 50 |
181
+
182
+ ### Ledger Reconciliation (2 scenarios)
183
+ Match and reconcile transactions across bank statements and internal ledgers.
184
+
185
+ | Scenario | Description | Max Steps |
186
+ |---|---|---|
187
+ | `ledger_reconciliation_01` | Reconcile bank statement against internal ledger β€” find mismatches | 50 |
188
+ | `ledger_reconciliation_02` | Reconcile USD and EUR transaction sheets into a unified summary | 50 |
189
+
190
+ ### Messy Table Extraction (1 scenario)
191
+ Extract and clean data from poorly formatted raw exports.
192
+
193
+ | Scenario | Description | Max Steps |
194
+ |---|---|---|
195
+ | `messy_table_extraction_01` | Extract/clean invoice data from messy export with mixed formats | 50 |
196
+
197
+ ### Range Transformation (1 scenario)
198
+ Reshape and pivot data between long-format and wide-format layouts.
199
+
200
+ | Scenario | Description | Max Steps |
201
+ |---|---|---|
202
+ | `range_transformation_01` | Pivot long-format employee metrics into wide-format table | 50 |
203
+
204
+ ### Schedule Grid Fill (1 scenario)
205
+ Fill structured grids respecting constraints and rules.
206
+
207
+ | Scenario | Description | Max Steps |
208
+ |---|---|---|
209
+ | `schedule_grid_fill_01` | Fill employee schedule grid for 12 employees Γ— 7 days | 50 |
210
+
211
+ ### Buggy Template Fix (1 scenario)
212
+ Debug template workbooks with multiple interacting formula errors.
213
+
214
+ | Scenario | Description | Max Steps |
215
+ |---|---|---|
216
+ | `buggy_template_fix_01` | Debug quarterly financial report template with broken Annual_Summary | 50 |
217
+
218
+ ## Architecture
219
+
220
+ ```
221
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
222
+ β”‚ OpenEnv Server (:8000) β”‚
223
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
224
+ β”‚ β”‚ FastMCP │──│ SpreadsheetEnv β”‚ β”‚
225
+ β”‚ β”‚ (13 tools) β”‚ β”‚ (MCPEnvironment) β”‚ β”‚
226
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
227
+ β”‚ β”‚ β”‚
228
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
229
+ β”‚ β”‚ Workbook β”‚ Scenario β”‚ β”‚
230
+ β”‚ β”‚ Engine β”‚ Loader β”‚ β”‚
231
+ β”‚ β”‚ (openpyxl) β”‚ β”‚ β”‚
232
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
233
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
234
  ```
235
 
236
+ All state is in-memory per session. No database, no external APIs. The workbook engine manages `.xlsx` files via openpyxl, tracks edits, and evaluates hidden tests. Formula evaluation uses the `formulas` library.
237
+
238
+ ## MCP Tools (13 total)
239
+
240
+ ### Session Management (4 tools)
241
+
242
+ | Tool | Description |
243
+ |------|-------------|
244
+ | `get_session_info` | Get current session metadata (scenario, step count, edit count, solved) |
245
+ | `list_scenarios` | List all available scenarios with description and max steps |
246
+ | `load_scenario` | Load a scenario and its workbook by ID |
247
+ | `reset_scenario` | Restore workbook to original state (scenario stays loaded) |
248
+
249
+ ### Reading (4 tools)
250
+
251
+ | Tool | Description |
252
+ |------|-------------|
253
+ | `list_sheets` | List all sheets with names, dimensions, visibility |
254
+ | `read_range` | Read cells from a sheet in A1 notation (formulas as strings) |
255
+ | `inspect_formula` | Get raw formula string from a specific cell |
256
+ | `list_named_targets` | Show allowed output zones for the scenario |
257
+
258
+ ### Writing (2 tools)
259
+
260
+ | Tool | Description |
261
+ |------|-------------|
262
+ | `write_cell` | Write a value or formula to a single cell |
263
+ | `write_range` | Write a 2D block of values starting from a cell |
264
+
265
+ ### Validation & Submission (2 tools)
266
+
267
+ | Tool | Description |
268
+ |------|-------------|
269
+ | `validate_partial` | Check progress against hidden tests without revealing answers |
270
+ | `submit_workbook` | Submit for final evaluation (pass rate + per-check results) |
271
+
272
+ ### History (1 tool)
273
+
274
+ | Tool | Description |
275
+ |------|-------------|
276
+ | `get_edit_history` | Full edit log with sheet, cell, value, step number |
277
+
278
  ## Reward System
279
 
280
+ This gym ships with **two** reward modes, selectable via `--reward-mode`:
281
+
282
+ ### Custom Rewards β€” Episode-Level (`rewards/checks.py`)
283
+
284
+ The `SpreadsheetChecker` verifies ground truth from the episode trajectory and computes a weighted score:
285
+
286
+ | Component | Weight | Description |
287
+ |---|---|---|
288
+ | `quality` | 0.25 | F1 of expected vs used tools + success rate |
289
+ | `efficiency` | 0.15 | `1.0 - (actual_steps / max_steps)` β€” fewer steps = higher |
290
+ | `ground_truth` | 0.60 | Hidden test pass rate from `submit_workbook` |
291
+ | `penalty` | variable | -0.5 (all calls succeed but 0% GT) or -0.2 (<30% GT) |
292
 
293
  ```
294
  total = 0.25 Γ— quality + 0.15 Γ— efficiency + 0.60 Γ— ground_truth + penalty
295
  ```
296
 
297
+ ```python
298
+ from rewards.checks import SpreadsheetChecker
299
+
300
+ checker = SpreadsheetChecker()
301
+ checker.set_episode(episode)
302
+ reward = checker.compute_episode_reward()
303
+ # {'quality': 0.72, 'efficiency': 0.65, 'ground_truth': 0.80, ..., 'total': 0.68}
304
+ ```
305
+
306
+ The base `RewardCalculator` (`rewards/base.py`) wraps this into the standard 3-component formula used across all gyms.
307
 
308
+ ### OpenEnV Transforms β€” Per-Step (`rewards/transforms.py`)
309
 
310
+ The `SpreadsheetStepTransform` provides fine-grained per-step rewards for RL training (GRPO). Each tool call receives a reward based on its outcome:
311
+
312
+ | Tool | Success | Failure |
313
+ |---|---|---|
314
+ | `read_range` / `list_sheets` | 0.0 (neutral) | 0.0 |
315
+ | `inspect_formula` | +0.05 | 0.0 |
316
+ | `validate_partial` (improved) | +0.10 | +0.05 |
317
+ | `write_cell` / `write_range` (in target, after read) | +0.10 | -0.10 |
318
+ | `write_cell` / `write_range` (out of target) | -0.10 | -0.10 |
319
+ | `submit_workbook` (100% pass) | +0.50 | β€” |
320
+ | `submit_workbook` (>50% pass) | +0.20 | β€” |
321
+ | `submit_workbook` (<30% pass) | -0.10 | β€” |
322
+
323
+ ```python
324
+ from rewards.transforms import SpreadsheetStepTransform
325
+
326
+ transform = SpreadsheetStepTransform()
327
+ scored_obs = transform(observation)
328
+ print(scored_obs.reward) # e.g., +0.10 for a write in target after reading
329
+ ```
330
+
331
+ The `OpenEnvRewardCalculator` (`rewards/base.py`) combines per-step rewards with ground truth into the same weighted formula, using sign-based quality scoring.
332
+
333
+ ## Evaluation
334
+
335
+ The included `run_eval.py` runs an LLM agent against scenarios and scores results.
336
+
337
+ ### Quick Start
338
 
339
  ```bash
340
+ cd spreadsheet
341
+ pip install -e .
342
+
343
+ # Build and run the environment
344
+ docker build -t openenv-spreadsheet -f Dockerfile .
345
+ docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet
346
+
347
+ # Verify
348
+ curl http://localhost:8000/health
349
+
350
+ # Evaluate (single model, custom rewards)
351
+ python run_eval.py --model gpt-5.4 --save --trajectory
352
+
353
+ # Evaluate (multiple models, per-step rewards)
354
+ python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 \
355
+ --parallel 3 --reward-mode openenv --save --trajectory
356
+
357
+ # Evaluate a specific scenario
358
+ python run_eval.py --model gpt-5.4 --scenario formula_repair_01
359
+
360
+ # Cleanup
361
+ docker stop spreadsheet && docker rm spreadsheet
362
  ```
363
+
364
+ ### Output Paths
365
+
366
+ | Output | Path |
367
+ |---|---|
368
+ | Results markdown | `outputs/results/<run_id>.md` |
369
+ | Trajectory JSON | `outputs/trajectories/<run_id>/<model>.json` |
370
+
371
+ Results files append per-model sections so you can accumulate multiple model runs in one file.
372
+
373
+ ### CLI Arguments
374
+
375
+ | Argument | Default | Description |
376
+ |---|---|---|
377
+ | `--model` | `gpt-4o` | LiteLLM model string (comma-separated for parallel) |
378
+ | `--scenario` | all | Run a specific scenario by ID |
379
+ | `--reward-mode` | `custom` | `custom` (episode-level) or `openenv` (per-step) |
380
+ | `--parallel` | `1` | Number of models to run in parallel |
381
+ | `--save` | off | Save results markdown |
382
+ | `--trajectory` | off | Save trajectory JSON |
383
+ | `--temperature` | `0.0` | LLM sampling temperature |
384
+ | `--max-tokens` | `1024` | Max tokens per LLM response |
385
+ | `--run-id` | auto | Run identifier for grouping outputs |
386
+ | `--verbose` | off | Enable debug logging |
387
+
388
+ ## Project Structure
389
+
390
+ ```
391
+ spreadsheet/
392
+ β”œβ”€β”€ __init__.py # Package exports (env + rewards)
393
+ β”œβ”€β”€ client.py # OpenEnv client integration
394
+ β”œβ”€β”€ models.py # Action/Observation data models
395
+ β”œβ”€β”€ openenv.yaml # OpenEnv AutoEnv manifest
396
+ β”œβ”€β”€ pyproject.toml # Dependencies (openenv-core v0.2.3)
397
+ β”œβ”€β”€ Dockerfile # Root Dockerfile for HF Spaces
398
+ β”œβ”€β”€ .dockerignore
399
+ β”œβ”€β”€ run_eval.py # LLM evaluation runner
400
+ β”‚
401
+ β”œβ”€β”€ rewards/ # Reward system (both modes)
402
+ β”‚ β”œβ”€β”€ __init__.py
403
+ β”‚ β”œβ”€β”€ base.py # Scenario, EpisodeLog, RewardCalculator,
404
+ β”‚ β”‚ # StepRewardTransform, OpenEnvRewardCalculator
405
+ β”‚ β”œβ”€β”€ checks.py # SpreadsheetChecker (episode-level)
406
+ β”‚ └── transforms.py # SpreadsheetStepTransform (per-step)
407
+ β”‚
408
+ β”œβ”€β”€ scenarios/ # Scenario definitions + JSON configs
409
+ β”‚ β”œβ”€β”€ __init__.py
410
+ β”‚ β”œβ”€β”€ definitions.py # 12 Scenario objects (Python)
411
+ β”‚ └── *.json # Scenario board configs
412
+ β”‚
413
+ β”œβ”€β”€ agent/ # LLM agent runner
414
+ β”‚ β”œβ”€β”€ __init__.py
415
+ β”‚ β”œβ”€β”€ llm.py # LiteLLM wrapper
416
+ β”‚ └── runner.py # AgentRunner (gym-agnostic)
417
+ β”‚
418
+ β”œβ”€β”€ server/ # OpenEnv environment server
419
+ β”‚ β”œβ”€β”€ __init__.py
420
+ β”‚ β”œβ”€β”€ app.py # FastAPI + FastMCP server
421
+ β”‚ β”œβ”€β”€ spreadsheet_environment.py # MCPEnvironment implementation
422
+ β”‚ β”œβ”€β”€ workbook_engine.py # Workbook engine (openpyxl)
423
+ β”‚ β”œβ”€β”€ formula_utils.py # Formula evaluation
424
+ β”‚ β”œβ”€β”€ scenario_loader.py # Scenario JSON loader
425
+ β”‚ └── Dockerfile # Server-only Dockerfile
426
+ β”‚
427
+ β”œβ”€β”€ workbooks/ # Workbook files
428
+ β”‚ β”œβ”€β”€ templates/ # Base workbook templates
429
+ β”‚ β”œβ”€β”€ fixtures/ # Test fixture workbooks
430
+ β”‚ └── hidden_tests/ # Hidden test check definitions
431
+ β”‚
432
+ └── outputs/ # Evaluation outputs (gitignored)
433
+ β”œβ”€β”€ results/ # Markdown result files
434
+ └── trajectories/ # JSON trajectory files
435
+ ```
436
+
437
+ ## Configuration (.env)
438
+
439
+ Copy `.env.example` to `.env` and fill in your API keys:
440
+
441
+ ```bash
442
+ cp .env.example .env
443
+ # Edit .env with your API keys
444
+ ```
445
+
446
+ ### LLM API Keys
447
+
448
+ | Variable | Required For | Description |
449
+ |----------|---|---|
450
+ | `OPENAI_API_KEY` | `gpt-4o`, `gpt-5.4`, `o3-pro` | OpenAI API key |
451
+ | `OPENAI_API_BASE` | OpenAI | API base URL (default: `https://api.openai.com/v1`) |
452
+ | `ANTHROPIC_API_KEY` | `claude-sonnet-4-6`, `claude-opus-4-6` | Anthropic API key |
453
+ | `GOOGLE_API_KEY` | `gemini-2.5-pro` | Google AI API key |
454
+
455
+ Only the key for your chosen `--model` provider is required. For local models via Ollama, no key is needed.
456
+
457
+ ### LLM Defaults
458
+
459
+ | Variable | Default | Description |
460
+ |----------|---------|-------------|
461
+ | `LLM_MODEL` | `gpt-4o` | Default model when `--model` is not specified |
462
+ | `LLM_TEMPERATURE` | `0.0` | Default sampling temperature |
463
+ | `LLM_MAX_TOKENS` | `1024` | Default max tokens per response |
464
+
465
+ ### Environment Server
466
+
467
+ | Variable | Default | Description |
468
+ |----------|---------|-------------|
469
+ | `OPENENV_PORT` | `8000` | OpenEnv server port (exposed) |
470
+ | `MAX_CONCURRENT_ENVS` | `8` | Max parallel evaluation sessions |
471
+ | `ENABLE_WEB_INTERFACE` | `true` | Enable HF Spaces web UI |
472
+ | `WORKBOOKS_DIR` | `workbooks` | Workbook files directory |
473
+ | `SCENARIOS_DIR` | `scenarios` | Scenario JSON directory |
474
+
475
+ ## Concurrent Sessions
476
+
477
+ Each evaluation session gets its own isolated workbook engine instance. Multiple agents can evaluate simultaneously against the same Docker container without interference.
__init__.py CHANGED
@@ -1,11 +1,26 @@
1
  """Spreadsheet Environment."""
2
 
3
  from .client import SpreadsheetEnv
4
- from .models import SpreadsheetAction, SpreadsheetObservation, SpreadsheetState
 
 
 
 
 
 
 
 
 
5
 
6
  __all__ = [
 
7
  "SpreadsheetAction",
8
  "SpreadsheetObservation",
9
  "SpreadsheetState",
10
- "SpreadsheetEnv",
 
 
 
 
 
11
  ]
 
1
  """Spreadsheet Environment."""
2
 
3
  from .client import SpreadsheetEnv
4
+ from .models import (
5
+ SpreadsheetAction,
6
+ SpreadsheetObservation,
7
+ SpreadsheetState,
8
+ CallToolAction,
9
+ CallToolObservation,
10
+ ListToolsAction,
11
+ ListToolsObservation,
12
+ )
13
+ from .rewards import SpreadsheetChecker, SpreadsheetStepTransform
14
 
15
  __all__ = [
16
+ "SpreadsheetEnv",
17
  "SpreadsheetAction",
18
  "SpreadsheetObservation",
19
  "SpreadsheetState",
20
+ "CallToolAction",
21
+ "CallToolObservation",
22
+ "ListToolsAction",
23
+ "ListToolsObservation",
24
+ "SpreadsheetChecker",
25
+ "SpreadsheetStepTransform",
26
  ]
agent/__init__.py ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ from .runner import AgentRunner
2
+ from .llm import LLMClient
3
+
4
+ __all__ = ["AgentRunner", "LLMClient"]
agent/llm.py ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ LLM abstraction layer using LiteLLM.
3
+
4
+ Supports any model LiteLLM supports β€” switch with a single string:
5
+ - OpenAI: "gpt-4o", "gpt-5.4", "o3-pro"
6
+ - Anthropic: "claude-opus-4-6", "claude-sonnet-4-6"
7
+ - Local: "ollama/llama3", "ollama/mistral"
8
+ - And 100+ more providers
9
+
10
+ API keys are read from environment variables (loaded from root .env):
11
+ OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.
12
+
13
+ Usage:
14
+ from agent.llm import LLMClient
15
+
16
+ llm = LLMClient(model="gpt-4o")
17
+ response = llm.chat(
18
+ messages=[{"role": "user", "content": "Hello"}],
19
+ tools=[...],
20
+ )
21
+ """
22
+
23
+ import json
24
+ import logging
25
+ from typing import Any, Dict, List, Optional
26
+
27
+ import litellm
28
+
29
+ logger = logging.getLogger(__name__)
30
+
31
+
32
+ class LLMClient:
33
+ """
34
+ Thin wrapper around LiteLLM for consistent tool-calling across providers.
35
+
36
+ The same code works whether you're hitting GPT-4o, Claude, or a local
37
+ Ollama model β€” LiteLLM handles the translation.
38
+ """
39
+
40
+ _REASONING_MODELS = {"o3-pro", "o3-mini", "o3", "o1", "o1-mini", "o1-pro", "gpt-5"}
41
+
42
+ def __init__(
43
+ self,
44
+ model: str,
45
+ temperature: float = 0.0,
46
+ max_tokens: int = 1024,
47
+ ):
48
+ self.model = model
49
+
50
+ if model in self._REASONING_MODELS:
51
+ self.temperature = 1.0
52
+ self.max_tokens = max(max_tokens, 4096)
53
+ if temperature != 1.0:
54
+ logger.info(f"Model {model} requires temperature=1.0, overriding from {temperature}")
55
+ else:
56
+ self.temperature = temperature
57
+ self.max_tokens = max_tokens
58
+
59
+ def chat(
60
+ self,
61
+ messages: List[Dict[str, Any]],
62
+ tools: Optional[List[Dict[str, Any]]] = None,
63
+ ) -> Any:
64
+ """
65
+ Send messages to the LLM and get a response.
66
+
67
+ Args:
68
+ messages: Conversation history in OpenAI format
69
+ tools: Optional list of tools in OpenAI function-calling format
70
+
71
+ Returns:
72
+ LiteLLM ModelResponse (same shape as OpenAI ChatCompletion).
73
+ """
74
+ kwargs: Dict[str, Any] = {
75
+ "model": self.model,
76
+ "messages": messages,
77
+ "temperature": self.temperature,
78
+ "max_tokens": self.max_tokens,
79
+ }
80
+
81
+ if tools:
82
+ kwargs["tools"] = tools
83
+ kwargs["tool_choice"] = "auto"
84
+
85
+ logger.debug(f"LLM request: model={self.model}, messages={len(messages)}, tools={len(tools or [])}")
86
+ response = litellm.completion(**kwargs)
87
+ logger.debug(f"LLM response: finish_reason={response.choices[0].finish_reason}")
88
+
89
+ return response
90
+
91
+ @staticmethod
92
+ def extract_tool_calls(response) -> List[Dict[str, Any]]:
93
+ """Extract tool calls from an LLM response."""
94
+ choice = response.choices[0]
95
+ if not choice.message.tool_calls:
96
+ return []
97
+
98
+ calls = []
99
+ for tc in choice.message.tool_calls:
100
+ args = tc.function.arguments
101
+ if isinstance(args, str):
102
+ args = json.loads(args)
103
+ calls.append({
104
+ "id": tc.id,
105
+ "name": tc.function.name,
106
+ "arguments": args,
107
+ })
108
+ return calls
109
+
110
+ @staticmethod
111
+ def get_text_response(response) -> Optional[str]:
112
+ """Extract plain text content from an LLM response (if any)."""
113
+ choice = response.choices[0]
114
+ return choice.message.content
agent/runner.py ADDED
@@ -0,0 +1,282 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Gym-agnostic Agent Runner β€” connects an LLM to any OpenEnv environment.
3
+
4
+ This module is the CORE of the evaluation platform. It:
5
+ 1. Receives a pre-connected OpenEnv client (from AutoEnv discovery)
6
+ 2. Discovers tools via list_tools()
7
+ 3. Gives the LLM a scenario prompt + available tools
8
+ 4. Loops: LLM reasons β†’ agent calls env.step() β†’ observation β†’ LLM reasons again
9
+ 5. Collects an EpisodeLog with timestamps for reward calculation + trajectory logging
10
+
11
+ Usage:
12
+ from openenv import AutoEnv
13
+ env = AutoEnv.from_env("spreadsheet", base_url="http://localhost:8000")
14
+ runner = AgentRunner(model="gpt-4o", env_client=env)
15
+ episode, breakdown = runner.run_scenario(scenario, checker)
16
+ """
17
+
18
+ import json
19
+ import logging
20
+ import time
21
+ from datetime import datetime, timezone, timedelta
22
+ from typing import Any, Dict, List, Tuple
23
+
24
+ IST = timezone(timedelta(hours=5, minutes=30))
25
+
26
+ from openenv.core.mcp_client import MCPToolClient
27
+ from openenv.core.env_server.mcp_types import CallToolAction, CallToolObservation, Tool
28
+
29
+ from ..rewards.base import (
30
+ EpisodeLog,
31
+ RewardBreakdown,
32
+ RewardCalculator,
33
+ Scenario,
34
+ OpenEnvRewardCalculator,
35
+ )
36
+ from .llm import LLMClient
37
+
38
+ logger = logging.getLogger(__name__)
39
+
40
+
41
+ SYSTEM_PROMPT = """\
42
+ You are an AI agent interacting with an environment through tools.
43
+
44
+ Your job:
45
+ 1. Read the task description carefully.
46
+ 2. Use the available tools to complete the task.
47
+ 3. Call tools one at a time. Wait for each result before deciding the next step.
48
+ 4. When the task is complete, respond with a plain text summary of what you did.
49
+ Do NOT call any more tools after you're done.
50
+
51
+ Rules:
52
+ - Only use tools that are listed as available.
53
+ - Provide all required arguments for each tool call.
54
+ - If a tool call fails, read the error and decide how to recover.
55
+ - Be efficient β€” complete the task in as few steps as possible.
56
+ - When you're done, clearly state what you accomplished.
57
+ """
58
+
59
+
60
+ def mcp_tools_to_openai(tools: List[Tool]) -> List[Dict[str, Any]]:
61
+ """Convert OpenEnv MCP tool definitions to OpenAI function-calling format."""
62
+ openai_tools = []
63
+ for tool in tools:
64
+ schema = tool.input_schema or {"type": "object", "properties": {}}
65
+ if "type" not in schema:
66
+ schema["type"] = "object"
67
+ if "properties" not in schema:
68
+ schema["properties"] = {}
69
+
70
+ openai_tools.append({
71
+ "type": "function",
72
+ "function": {
73
+ "name": tool.name,
74
+ "description": tool.description or "",
75
+ "parameters": schema,
76
+ },
77
+ })
78
+ return openai_tools
79
+
80
+
81
+ def _observation_to_str(step_result) -> str:
82
+ """Convert an OpenEnv step result to a string the LLM can read."""
83
+ obs = step_result.observation
84
+ if isinstance(obs, CallToolObservation):
85
+ if obs.error:
86
+ return json.dumps({"error": obs.error.message}, indent=2)
87
+ result = obs.result
88
+ if hasattr(result, "data"):
89
+ result = result.data
90
+ elif isinstance(result, dict) and "data" in result:
91
+ result = result["data"]
92
+ try:
93
+ return json.dumps(result, indent=2, default=str)
94
+ except (TypeError, ValueError):
95
+ return str(result)
96
+ if hasattr(obs, "metadata") and obs.metadata:
97
+ return json.dumps(obs.metadata, indent=2, default=str)
98
+ return str(obs)
99
+
100
+
101
+ class AgentRunner:
102
+ """
103
+ Gym-agnostic agent that connects an LLM to any OpenEnv environment.
104
+
105
+ Reward modes:
106
+ - "custom" (default): Episode-level reward via RewardCalculator
107
+ - "openenv": Per-step reward via Transform + ground truth
108
+ """
109
+
110
+ def __init__(
111
+ self,
112
+ model: str,
113
+ env_client: MCPToolClient,
114
+ temperature: float = 0.0,
115
+ max_tokens: int = 1024,
116
+ reward_mode: str = "custom",
117
+ transform=None,
118
+ ):
119
+ self.llm = LLMClient(
120
+ model=model,
121
+ temperature=temperature,
122
+ max_tokens=max_tokens,
123
+ )
124
+ self.env_client = env_client
125
+ self.reward_mode = reward_mode
126
+ self.transform = transform
127
+
128
+ self.calculator = RewardCalculator()
129
+
130
+ if reward_mode == "openenv":
131
+ self.openenv_calculator = OpenEnvRewardCalculator()
132
+
133
+ def run_scenario(
134
+ self,
135
+ scenario: Scenario,
136
+ checker: Any,
137
+ ) -> Tuple[EpisodeLog, RewardBreakdown]:
138
+ """Run a single scenario through the LLM agent."""
139
+ return self._execute(scenario, checker, self.env_client)
140
+
141
+ def _execute(
142
+ self,
143
+ scenario: Scenario,
144
+ checker: Any,
145
+ env: MCPToolClient,
146
+ ) -> Tuple[EpisodeLog, RewardBreakdown]:
147
+
148
+ env.reset()
149
+
150
+ session_id = None
151
+ try:
152
+ session_result = env.step(
153
+ CallToolAction(tool_name="get_session_info", arguments={})
154
+ )
155
+ obs = session_result.observation
156
+ if isinstance(obs, CallToolObservation) and obs.result:
157
+ result_data = obs.result
158
+ if hasattr(result_data, "data"):
159
+ result_data = result_data.data
160
+ elif isinstance(result_data, dict) and "data" in result_data:
161
+ result_data = result_data["data"]
162
+ if isinstance(result_data, dict):
163
+ session_id = result_data.get("session_id")
164
+ elif isinstance(result_data, str):
165
+ import json as _json
166
+ try:
167
+ parsed = _json.loads(result_data)
168
+ session_id = parsed.get("session_id")
169
+ except (ValueError, TypeError):
170
+ pass
171
+ except Exception as e:
172
+ logger.warning(f"Could not get session_id: {e}")
173
+
174
+ if session_id and hasattr(checker, "set_session"):
175
+ checker.set_session(session_id)
176
+ logger.info(f"Session-scoped checker -> {session_id}")
177
+
178
+ if self.transform and hasattr(self.transform, "set_scenario"):
179
+ self.transform.set_scenario(scenario)
180
+
181
+ all_tools = env.list_tools(use_cache=False)
182
+ tools = [t for t in all_tools if t.name != "get_session_info"]
183
+ openai_tools = mcp_tools_to_openai(tools)
184
+ tool_names = [t.name for t in tools]
185
+ logger.info(f"Discovered {len(tools)} agent tools: {tool_names}")
186
+
187
+ messages = [
188
+ {"role": "system", "content": SYSTEM_PROMPT},
189
+ {"role": "user", "content": scenario.prompt},
190
+ ]
191
+
192
+ episode = EpisodeLog()
193
+ step_rewards = []
194
+ final_answer = None
195
+
196
+ for step_num in range(1, scenario.max_steps + 1):
197
+ logger.info(f"Step {step_num}/{scenario.max_steps}")
198
+
199
+ response = self.llm.chat(messages, tools=openai_tools)
200
+ tool_calls = LLMClient.extract_tool_calls(response)
201
+
202
+ if not tool_calls:
203
+ final_answer = LLMClient.get_text_response(response)
204
+ logger.info(f"Agent done. Final answer: {(final_answer or '')[:100]}...")
205
+ break
206
+
207
+ messages.append(response.choices[0].message.model_dump())
208
+
209
+ for tc in tool_calls:
210
+ tool_name = tc["name"]
211
+ arguments = tc["arguments"]
212
+ call_id = tc["id"]
213
+
214
+ logger.info(f" Tool: {tool_name}({json.dumps(arguments, default=str)[:100]})")
215
+
216
+ step_ts = datetime.now(IST).isoformat()
217
+ step_start = time.time()
218
+ error_msg = None
219
+ try:
220
+ step_result = env.step(
221
+ CallToolAction(tool_name=tool_name, arguments=arguments)
222
+ )
223
+ obs = step_result.observation
224
+ is_error = (
225
+ isinstance(obs, CallToolObservation)
226
+ and obs.error is not None
227
+ )
228
+ result_str = _observation_to_str(step_result)
229
+ if is_error and isinstance(obs, CallToolObservation):
230
+ error_msg = obs.error.message
231
+ except Exception as exc:
232
+ is_error = True
233
+ error_msg = str(exc)
234
+ result_str = json.dumps({"error": error_msg})
235
+ obs = None
236
+
237
+ step_elapsed = time.time() - step_start
238
+
239
+ if self.reward_mode == "openenv" and self.transform and obs is not None:
240
+ transformed = self.transform(obs)
241
+ step_rewards.append(
242
+ transformed.reward if transformed.reward is not None else 0.0
243
+ )
244
+
245
+ episode.add_step(
246
+ tool_name=tool_name,
247
+ arguments=arguments,
248
+ success=not is_error,
249
+ result=result_str,
250
+ error=error_msg,
251
+ timestamp=step_ts,
252
+ elapsed=step_elapsed,
253
+ )
254
+
255
+ logger.info(f" -> success={not is_error} ({step_elapsed:.2f}s)")
256
+
257
+ messages.append({
258
+ "role": "tool",
259
+ "tool_call_id": call_id,
260
+ "content": result_str,
261
+ })
262
+
263
+ if hasattr(checker, "set_episode"):
264
+ checker.set_episode(episode)
265
+
266
+ outcome_results = checker.check_all(scenario.outcome_checks)
267
+
268
+ if self.reward_mode == "openenv":
269
+ breakdown = self.openenv_calculator.calculate(
270
+ step_rewards=step_rewards,
271
+ outcome_results=outcome_results,
272
+ max_steps=scenario.max_steps,
273
+ actual_steps=len(episode.steps),
274
+ )
275
+ else:
276
+ breakdown = self.calculator.calculate(
277
+ episode=episode,
278
+ scenario=scenario,
279
+ outcome_results=outcome_results,
280
+ )
281
+
282
+ return episode, breakdown
pyproject.toml CHANGED
@@ -8,7 +8,7 @@ version = "0.1.0"
8
  description = "Spreadsheet gym β€” exact workbook manipulation and reasoning over realistic spreadsheet tasks"
9
  requires-python = ">=3.11"
10
  dependencies = [
11
- "openenv-core @ git+https://github.com/meta-pytorch/OpenEnv.git@v0.2.1",
12
  "fastapi>=0.115.0",
13
  "pydantic>=2.0.0",
14
  "uvicorn[standard]>=0.24.0",
@@ -17,6 +17,9 @@ dependencies = [
17
  "openpyxl>=3.1.0",
18
  "pandas>=2.0.0",
19
  "formulas>=1.2.0",
 
 
 
20
  ]
21
 
22
  [project.optional-dependencies]
@@ -30,8 +33,21 @@ server = "spreadsheet.server.app:main"
30
 
31
  [tool.setuptools]
32
  include-package-data = true
33
- packages = ["spreadsheet", "spreadsheet.server"]
34
- package-dir = { "spreadsheet" = ".", "spreadsheet.server" = "server" }
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  [tool.setuptools.package-data]
37
  spreadsheet = ["openenv.yaml"]
 
 
8
  description = "Spreadsheet gym β€” exact workbook manipulation and reasoning over realistic spreadsheet tasks"
9
  requires-python = ">=3.11"
10
  dependencies = [
11
+ "openenv-core @ git+https://github.com/meta-pytorch/OpenEnv.git@v0.2.3",
12
  "fastapi>=0.115.0",
13
  "pydantic>=2.0.0",
14
  "uvicorn[standard]>=0.24.0",
 
17
  "openpyxl>=3.1.0",
18
  "pandas>=2.0.0",
19
  "formulas>=1.2.0",
20
+ "litellm>=1.0.0",
21
+ "pyyaml>=6.0.0",
22
+ "python-dotenv>=1.0.0",
23
  ]
24
 
25
  [project.optional-dependencies]
 
33
 
34
  [tool.setuptools]
35
  include-package-data = true
36
+ packages = [
37
+ "spreadsheet",
38
+ "spreadsheet.server",
39
+ "spreadsheet.rewards",
40
+ "spreadsheet.scenarios",
41
+ "spreadsheet.agent",
42
+ ]
43
+
44
+ [tool.setuptools.package-dir]
45
+ spreadsheet = "."
46
+ "spreadsheet.server" = "server"
47
+ "spreadsheet.rewards" = "rewards"
48
+ "spreadsheet.scenarios" = "scenarios"
49
+ "spreadsheet.agent" = "agent"
50
 
51
  [tool.setuptools.package-data]
52
  spreadsheet = ["openenv.yaml"]
53
+ "spreadsheet.scenarios" = ["*.json"]
rewards/__init__.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from .checks import SpreadsheetChecker
2
+ from .transforms import SpreadsheetStepTransform
3
+ from .base import (
4
+ Scenario,
5
+ EpisodeLog,
6
+ StepLog,
7
+ RewardBreakdown,
8
+ RewardCalculator,
9
+ StepRewardTransform,
10
+ OpenEnvRewardCalculator,
11
+ )
12
+
13
+ __all__ = [
14
+ "SpreadsheetChecker",
15
+ "SpreadsheetStepTransform",
16
+ "Scenario",
17
+ "EpisodeLog",
18
+ "StepLog",
19
+ "RewardBreakdown",
20
+ "RewardCalculator",
21
+ "StepRewardTransform",
22
+ "OpenEnvRewardCalculator",
23
+ ]
rewards/base.py ADDED
@@ -0,0 +1,313 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Base reward infrastructure β€” data classes, calculators, and transforms.
3
+
4
+ Merged from the shared repo-level modules into a self-contained file:
5
+ - Episode-level: RewardCalculator (custom mode)
6
+ - Per-step: StepRewardTransform + OpenEnvRewardCalculator (openenv mode)
7
+
8
+ Scoring formula (both modes):
9
+ total = 0.25 * quality/structural + 0.15 * efficiency + 0.60 * ground_truth + penalty
10
+
11
+ Usage:
12
+ from rewards.base import RewardCalculator, Scenario, EpisodeLog
13
+ calculator = RewardCalculator()
14
+ breakdown = calculator.calculate(episode, scenario, outcome_results)
15
+ """
16
+
17
+ from dataclasses import dataclass, field
18
+ from typing import Any, Dict, List, Optional, Set
19
+
20
+ from openenv.core.env_server.interfaces import Transform
21
+ from openenv.core.env_server.mcp_types import CallToolObservation
22
+ from openenv.core.env_server.types import Observation
23
+
24
+
25
+ # ── Data Classes ──
26
+
27
+
28
+ @dataclass
29
+ class StepLog:
30
+ """Record of a single tool call made by the agent."""
31
+
32
+ tool_name: str
33
+ arguments: Dict[str, Any]
34
+ success: bool
35
+ result: Any = None
36
+ error: Optional[str] = None
37
+ timestamp: Optional[str] = None
38
+ elapsed: float = 0.0
39
+
40
+
41
+ @dataclass
42
+ class EpisodeLog:
43
+ """Record of all tool calls in one episode."""
44
+
45
+ steps: List[StepLog] = field(default_factory=list)
46
+
47
+ def add_step(
48
+ self,
49
+ tool_name: str,
50
+ arguments: Dict[str, Any],
51
+ success: bool,
52
+ result: Any = None,
53
+ error: Optional[str] = None,
54
+ timestamp: Optional[str] = None,
55
+ elapsed: float = 0.0,
56
+ ) -> None:
57
+ self.steps.append(
58
+ StepLog(
59
+ tool_name=tool_name,
60
+ arguments=arguments,
61
+ success=success,
62
+ result=result,
63
+ error=error,
64
+ timestamp=timestamp,
65
+ elapsed=elapsed,
66
+ )
67
+ )
68
+
69
+ @property
70
+ def tools_used(self) -> List[str]:
71
+ return [s.tool_name for s in self.steps]
72
+
73
+ @property
74
+ def tools_used_set(self) -> Set[str]:
75
+ return set(self.tools_used)
76
+
77
+
78
+ @dataclass
79
+ class Scenario:
80
+ """Definition of a task for the agent."""
81
+
82
+ id: str
83
+ prompt: str
84
+ expected_tools: List[str]
85
+ max_steps: int
86
+ outcome_checks: List[Dict[str, Any]]
87
+
88
+
89
+ @dataclass
90
+ class RewardBreakdown:
91
+ """Detailed reward breakdown β€” useful for debugging and logging."""
92
+
93
+ structural: float = 0.0
94
+ ground_truth: float = 0.0
95
+ efficiency: float = 0.0
96
+ penalty: float = 0.0
97
+ total: float = 0.0
98
+ details: Dict[str, Any] = field(default_factory=dict)
99
+
100
+ def summary(self) -> str:
101
+ mode = self.details.get("reward_mode", "custom")
102
+ qual_label = "Quality" if mode == "openenv" else "Structural"
103
+ lines = [
104
+ f" {qual_label + ':':14s}{self.structural:.2f} (weight 0.25)",
105
+ f" Efficiency: {self.efficiency:.2f} (weight 0.15)",
106
+ f" Ground Truth: {self.ground_truth:.2f} (weight 0.60)",
107
+ ]
108
+ if self.penalty < 0:
109
+ lines.append(f" Penalty: {self.penalty:.2f} (hallucination)")
110
+ lines.append(f" ────────────────────────")
111
+ lines.append(f" TOTAL: {self.total:.2f}")
112
+ return "\n".join(lines)
113
+
114
+
115
+ # ── Episode-Level Reward Calculator (custom mode) ──
116
+
117
+
118
+ class RewardCalculator:
119
+ """
120
+ Computes episode-level reward from logs + scenario + verification results.
121
+
122
+ Weights: structural (0.25), ground_truth (0.60), efficiency (0.15).
123
+ """
124
+
125
+ def __init__(
126
+ self,
127
+ w_structural: float = 0.25,
128
+ w_ground_truth: float = 0.60,
129
+ w_efficiency: float = 0.15,
130
+ ):
131
+ self.w_structural = w_structural
132
+ self.w_ground_truth = w_ground_truth
133
+ self.w_efficiency = w_efficiency
134
+
135
+ def calculate(
136
+ self,
137
+ episode: EpisodeLog,
138
+ scenario: Scenario,
139
+ outcome_results: List[float],
140
+ ) -> RewardBreakdown:
141
+ breakdown = RewardBreakdown()
142
+
143
+ breakdown.structural = self._structural_score(episode, scenario)
144
+ breakdown.ground_truth = self._ground_truth_score(outcome_results)
145
+ breakdown.efficiency = self._efficiency_score(episode, scenario)
146
+ breakdown.penalty = self._hallucination_penalty(episode, outcome_results)
147
+
148
+ breakdown.total = (
149
+ self.w_structural * breakdown.structural
150
+ + self.w_ground_truth * breakdown.ground_truth
151
+ + self.w_efficiency * breakdown.efficiency
152
+ + breakdown.penalty
153
+ )
154
+ breakdown.total = max(-1.0, min(1.0, breakdown.total))
155
+
156
+ breakdown.details = {
157
+ "tools_expected": scenario.expected_tools,
158
+ "tools_used": episode.tools_used,
159
+ "outcome_checks_score_sum": sum(outcome_results),
160
+ "outcome_checks_total": len(outcome_results),
161
+ "outcome_checks_avg": sum(outcome_results) / len(outcome_results) if outcome_results else 0.0,
162
+ "steps_taken": len(episode.steps),
163
+ "max_steps": scenario.max_steps,
164
+ }
165
+
166
+ return breakdown
167
+
168
+ def _structural_score(self, episode: EpisodeLog, scenario: Scenario) -> float:
169
+ if not episode.steps:
170
+ return 0.0
171
+
172
+ expected = set(scenario.expected_tools)
173
+ used = episode.tools_used_set
174
+
175
+ intersection = expected & used
176
+ precision = len(intersection) / len(used) if used else 0.0
177
+ recall = len(intersection) / len(expected) if expected else 0.0
178
+ f1 = (
179
+ 2 * precision * recall / (precision + recall)
180
+ if (precision + recall) > 0
181
+ else 0.0
182
+ )
183
+
184
+ success_rate = sum(1 for s in episode.steps if s.success) / len(episode.steps)
185
+
186
+ unexpected_calls = sum(
187
+ 1 for s in episode.steps if s.tool_name not in expected
188
+ )
189
+ unexpected_ratio = unexpected_calls / len(episode.steps)
190
+
191
+ return max(0.0, 0.6 * f1 + 0.4 * success_rate - unexpected_ratio * 0.3)
192
+
193
+ def _ground_truth_score(self, outcome_results: List[float]) -> float:
194
+ if not outcome_results:
195
+ return 0.0
196
+ return sum(outcome_results) / len(outcome_results)
197
+
198
+ def _efficiency_score(self, episode: EpisodeLog, scenario: Scenario) -> float:
199
+ if not episode.steps:
200
+ return 0.0
201
+ return max(0.0, 1.0 - len(episode.steps) / scenario.max_steps)
202
+
203
+ def _hallucination_penalty(
204
+ self, episode: EpisodeLog, outcome_results: List[float]
205
+ ) -> float:
206
+ if not episode.steps or not outcome_results:
207
+ return 0.0
208
+
209
+ all_calls_succeeded = all(s.success for s in episode.steps)
210
+ pass_rate = sum(outcome_results) / len(outcome_results)
211
+
212
+ if all_calls_succeeded and pass_rate == 0.0:
213
+ return -0.5
214
+ if all_calls_succeeded and pass_rate < 0.3:
215
+ return -0.2
216
+
217
+ return 0.0
218
+
219
+
220
+ # ── Per-Step Reward Transform (openenv mode) ──
221
+
222
+
223
+ class StepRewardTransform(Transform):
224
+ """
225
+ Gym-agnostic per-step reward transform.
226
+
227
+ Sets observation.reward based on tool call success/failure.
228
+ Subclass for gym-specific logic (see transforms.py).
229
+ """
230
+
231
+ def __call__(self, observation: Observation) -> Observation:
232
+ reward = self._compute_reward(observation)
233
+ observation.reward = reward
234
+ return observation
235
+
236
+ def _compute_reward(self, observation: Observation) -> float:
237
+ if isinstance(observation, CallToolObservation):
238
+ if observation.error is not None:
239
+ return -0.5
240
+ return 1.0
241
+ return 0.0
242
+
243
+
244
+ class OpenEnvRewardCalculator:
245
+ """
246
+ Combines per-step transform rewards with ground truth verification.
247
+
248
+ Used as the alternative to RewardCalculator when --reward-mode openenv.
249
+
250
+ Quality is sign-based: only the sign of per-step rewards matters
251
+ (positive = productive, negative = harmful, zero = neutral).
252
+ """
253
+
254
+ def __init__(
255
+ self,
256
+ w_quality: float = 0.25,
257
+ w_efficiency: float = 0.15,
258
+ w_ground_truth: float = 0.60,
259
+ ):
260
+ self.w_quality = w_quality
261
+ self.w_efficiency = w_efficiency
262
+ self.w_ground_truth = w_ground_truth
263
+
264
+ def calculate(
265
+ self,
266
+ step_rewards: List[float],
267
+ outcome_results: List[bool],
268
+ max_steps: int = 0,
269
+ actual_steps: int = 0,
270
+ ) -> RewardBreakdown:
271
+ productive = sum(1 for r in step_rewards if r > 0)
272
+ harmful = sum(1 for r in step_rewards if r < 0)
273
+ active = productive + harmful
274
+ quality = productive / active if active > 0 else 0.0
275
+
276
+ if max_steps > 0 and actual_steps > 0:
277
+ efficiency = max(0.0, 1.0 - actual_steps / max_steps)
278
+ else:
279
+ efficiency = 0.0
280
+
281
+ gt_score = sum(outcome_results) / len(outcome_results) if outcome_results else 0.0
282
+
283
+ penalty = 0.0
284
+ if step_rewards and outcome_results:
285
+ all_positive = all(r > 0 for r in step_rewards)
286
+ if all_positive and gt_score == 0.0:
287
+ penalty = -0.5
288
+ elif all_positive and gt_score < 0.3:
289
+ penalty = -0.2
290
+
291
+ total = (
292
+ self.w_quality * quality
293
+ + self.w_efficiency * efficiency
294
+ + self.w_ground_truth * gt_score
295
+ + penalty
296
+ )
297
+ total = max(-1.0, min(1.0, total))
298
+
299
+ return RewardBreakdown(
300
+ structural=quality,
301
+ ground_truth=gt_score,
302
+ efficiency=efficiency,
303
+ penalty=penalty,
304
+ total=total,
305
+ details={
306
+ "reward_mode": "openenv",
307
+ "productive_steps": productive,
308
+ "harmful_steps": harmful,
309
+ "neutral_steps": len(step_rewards) - active,
310
+ "actual_steps": actual_steps,
311
+ "max_steps": max_steps,
312
+ },
313
+ )
rewards/checks.py ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Spreadsheet outcome checks β€” ground truth from episode trajectory.
3
+
4
+ Ground truth is reconstructed from the episode log:
5
+ - submit_workbook tool call: pass_rate, per-check results
6
+ - validate_partial calls: intermediate progress
7
+ - write_cell/write_range calls: edit targets and counts
8
+
9
+ Check types:
10
+ - hidden_test_pass_rate : fraction of hidden checks that passed on final submit
11
+
12
+ Layer 2 (custom mode) uses RewardCalculator from rewards/base.py with
13
+ structural + ground_truth + efficiency. This checker provides outcome_results.
14
+
15
+ Additionally, compute_custom_reward() provides the 6-component detailed
16
+ breakdown specified in the spreadsheet gym prompt:
17
+ hidden_test_pass_rate (0.40)
18
+ formula_correctness (0.20)
19
+ edit_efficiency (0.10)
20
+ invalid_edit_penalty (0.10)
21
+ structural_integrity (0.10)
22
+ debugging_quality (0.10)
23
+ """
24
+
25
+ from __future__ import annotations
26
+
27
+ import json
28
+ from typing import Any, Dict, List, Optional
29
+
30
+
31
+ WRITE_TOOLS = frozenset({"write_cell", "write_range"})
32
+ READ_TOOLS = frozenset({"read_range", "read_cell"})
33
+
34
+
35
+ def _parse_result(result: Any) -> dict:
36
+ if isinstance(result, dict):
37
+ return result
38
+ if isinstance(result, str):
39
+ try:
40
+ return json.loads(result)
41
+ except (json.JSONDecodeError, TypeError):
42
+ return {}
43
+ if hasattr(result, "data") and isinstance(result.data, dict):
44
+ return result.data
45
+ return {}
46
+
47
+
48
+ class SpreadsheetChecker:
49
+ """Verifies outcomes from the episode log + scenario outcome_checks."""
50
+
51
+ def __init__(self, api_url: str | None = None, session_id: str | None = None):
52
+ self._api_url = api_url
53
+ self._session_id = session_id
54
+ self._steps: List[dict] = []
55
+
56
+ def set_episode(self, episode) -> None:
57
+ """Populate from episode trajectory (called by run_eval before check_all)."""
58
+ self._steps = []
59
+ for step in episode.steps:
60
+ self._steps.append({
61
+ "tool_name": step.tool_name,
62
+ "arguments": step.arguments or {},
63
+ "success": step.success,
64
+ "result": _parse_result(step.result),
65
+ })
66
+
67
+ def set_session(self, session_id: str) -> None:
68
+ self._session_id = session_id
69
+
70
+ def check_all(self, checks: List[Dict[str, Any]]) -> List[bool]:
71
+ return [self._run_check(c) for c in checks]
72
+
73
+ def _run_check(self, check: Dict[str, Any]) -> bool:
74
+ check_type = check.get("type", "")
75
+ if check_type == "hidden_test_pass_rate":
76
+ return self._check_hidden_test_pass_rate(check)
77
+ return False
78
+
79
+ def _check_hidden_test_pass_rate(self, check: dict) -> bool:
80
+ """Check if submit_workbook achieved the minimum pass rate."""
81
+ min_rate = check.get("min_pass_rate", 0.5)
82
+ submit_result = self._get_last_submit_result()
83
+ if not submit_result:
84
+ return False
85
+ return submit_result.get("pass_rate", 0) >= min_rate
86
+
87
+ def _get_last_submit_result(self) -> dict:
88
+ """Extract the result from the last submit_workbook call."""
89
+ for s in reversed(self._steps):
90
+ if s["tool_name"] == "submit_workbook" and s["success"]:
91
+ return s.get("result", {})
92
+ return {}
93
+
94
+ # ── 6-component custom reward (Layer 2 detailed breakdown) ─────
95
+
96
+ def compute_custom_reward(self) -> Dict[str, float]:
97
+ """
98
+ 6-component detailed custom reward.
99
+
100
+ Returns dict with component scores (each 0.0–1.0) and weighted total.
101
+ """
102
+ submit = self._get_last_submit_result()
103
+
104
+ hidden = self._score_hidden_test_pass_rate(submit)
105
+ formula = self._score_formula_correctness(submit)
106
+ efficiency = self._score_edit_efficiency()
107
+ invalid = self._score_invalid_edit_penalty()
108
+ integrity = self._score_structural_integrity()
109
+ debugging = self._score_debugging_quality()
110
+
111
+ total = (
112
+ 0.40 * hidden
113
+ + 0.20 * formula
114
+ + 0.10 * efficiency
115
+ + 0.10 * (1.0 - invalid)
116
+ + 0.10 * integrity
117
+ + 0.10 * debugging
118
+ )
119
+
120
+ return {
121
+ "hidden_test_pass_rate": round(hidden, 4),
122
+ "formula_correctness": round(formula, 4),
123
+ "edit_efficiency": round(efficiency, 4),
124
+ "invalid_edit_penalty": round(invalid, 4),
125
+ "structural_integrity": round(integrity, 4),
126
+ "debugging_quality": round(debugging, 4),
127
+ "total": round(max(0.0, min(1.0, total)), 4),
128
+ }
129
+
130
+ def _score_hidden_test_pass_rate(self, submit: dict) -> float:
131
+ """Fraction of hidden checks that passed."""
132
+ return submit.get("pass_rate", 0.0) if submit else 0.0
133
+
134
+ def _score_formula_correctness(self, submit: dict) -> float:
135
+ """Among formula-type checks, what fraction passed?"""
136
+ if not submit:
137
+ return 0.0
138
+ details = submit.get("details", [])
139
+ if not details:
140
+ return self._score_hidden_test_pass_rate(submit)
141
+
142
+ formula_checks = [d for d in details if d.get("check_type") == "expected_formula"]
143
+ if not formula_checks:
144
+ return 1.0
145
+ passed = sum(1 for d in formula_checks if d.get("passed"))
146
+ return passed / len(formula_checks)
147
+
148
+ def _score_edit_efficiency(self) -> float:
149
+ """Ratio of minimum plausible edits to actual write steps. Lower steps = higher score."""
150
+ write_steps = [s for s in self._steps if s["tool_name"] in WRITE_TOOLS and s["success"]]
151
+ if not write_steps:
152
+ return 0.0
153
+ unique_targets = set()
154
+ for s in write_steps:
155
+ args = s["arguments"]
156
+ sheet = args.get("sheet", "")
157
+ cell = args.get("cell", args.get("start_cell", ""))
158
+ unique_targets.add(f"{sheet}:{cell}")
159
+ min_edits = len(unique_targets)
160
+ actual_edits = len(write_steps)
161
+ if actual_edits == 0:
162
+ return 0.0
163
+ return min(min_edits / actual_edits, 1.0)
164
+
165
+ def _score_invalid_edit_penalty(self) -> float:
166
+ """Fraction of writes that targeted non-output areas (0.0 = no invalid edits)."""
167
+ write_steps = [s for s in self._steps if s["tool_name"] in WRITE_TOOLS and s["success"]]
168
+ if not write_steps:
169
+ return 0.0
170
+ invalid = sum(
171
+ 1 for s in write_steps
172
+ if isinstance(s.get("result"), dict) and s["result"].get("outside_target")
173
+ )
174
+ return invalid / len(write_steps)
175
+
176
+ def _score_structural_integrity(self) -> float:
177
+ """
178
+ Did the agent preserve existing correct data?
179
+
180
+ We check whether the final submit had any checks with 'overwrite_detected' failures.
181
+ If no destructive overwrites detected, score = 1.0.
182
+ """
183
+ submit = self._get_last_submit_result()
184
+ if not submit:
185
+ return 0.5
186
+ details = submit.get("details", [])
187
+ if not details:
188
+ return 1.0
189
+ total = len(details)
190
+ overwrites = sum(1 for d in details if d.get("overwrite_detected"))
191
+ return 1.0 - (overwrites / total) if total > 0 else 1.0
192
+
193
+ def _score_debugging_quality(self) -> float:
194
+ """Evidence of reading before writing; inspecting formulas before modifying."""
195
+ if not self._steps:
196
+ return 0.0
197
+
198
+ score = 0.0
199
+ components = 0
200
+
201
+ read_before_write = self._has_read_before_write_pattern()
202
+ components += 1
203
+ score += 1.0 if read_before_write else 0.0
204
+
205
+ inspect_count = sum(1 for s in self._steps if s["tool_name"] == "inspect_formula")
206
+ if inspect_count > 0:
207
+ components += 1
208
+ score += 1.0
209
+
210
+ validate_count = sum(1 for s in self._steps if s["tool_name"] == "validate_partial")
211
+ if validate_count > 0:
212
+ components += 1
213
+ score += 1.0
214
+
215
+ list_sheets = any(s["tool_name"] == "list_sheets" for s in self._steps)
216
+ if list_sheets:
217
+ components += 1
218
+ score += 1.0
219
+
220
+ return score / max(components, 1)
221
+
222
+ def _has_read_before_write_pattern(self) -> bool:
223
+ """Check if at least one write was preceded by a read within 4 steps."""
224
+ for i, s in enumerate(self._steps):
225
+ if s["tool_name"] not in WRITE_TOOLS:
226
+ continue
227
+ lookback = self._steps[max(0, i - 4):i]
228
+ if any(p["tool_name"] in READ_TOOLS for p in lookback):
229
+ return True
230
+ return False
rewards/transforms.py ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Spreadsheet per-step reward transform (Layer 3).
3
+
4
+ Used when: --reward-mode openenv --gym spreadsheet
5
+
6
+ Scoring by tool:
7
+
8
+ read_range / read_cell:
9
+ Successful read β†’ +0.02 (neutral, slightly positive)
10
+
11
+ write_cell / write_range:
12
+ Write to target region β†’ +0.05
13
+ Write preceded by recent read (≀4 steps) β†’ +0.05 (on top of base)
14
+ Write outside target region β†’ -0.10
15
+ Repeated write to same cell (β‰₯3) β†’ -0.05
16
+
17
+ inspect_formula:
18
+ Successful β†’ +0.05
19
+
20
+ validate_partial:
21
+ Called successfully β†’ +0.05
22
+ Shows improvement over last validate β†’ +0.10
23
+
24
+ submit_workbook:
25
+ All checks pass (pass_rate == 1.0) β†’ +0.50
26
+ Partial pass (pass_rate > 0.5) β†’ +0.20
27
+ Mostly failing (pass_rate < 0.3) β†’ -0.10
28
+ No prior validate_partial call β†’ -0.05 (unsupported submission)
29
+
30
+ list_sheets / list_scenarios / get_session_info / load_scenario /
31
+ get_edit_history / reset_scenario / list_named_targets:
32
+ Successful β†’ 0.0 (neutral)
33
+
34
+ Error on any non-neutral tool β†’ -0.05
35
+ """
36
+
37
+ from __future__ import annotations
38
+
39
+ import json
40
+ from typing import Any
41
+
42
+ from openenv.core.env_server.mcp_types import CallToolObservation
43
+ from openenv.core.env_server.types import Observation
44
+
45
+ from .base import StepRewardTransform
46
+
47
+ WRITE_TOOLS = frozenset({"write_cell", "write_range"})
48
+ READ_TOOLS = frozenset({"read_range", "read_cell"})
49
+ NEUTRAL_TOOLS = frozenset({
50
+ "list_sheets", "list_scenarios", "get_session_info",
51
+ "load_scenario", "get_edit_history", "reset_scenario",
52
+ "list_named_targets", "list_tools",
53
+ })
54
+
55
+
56
+ def _extract_result(observation) -> Any:
57
+ result = getattr(observation, "result", None)
58
+ if hasattr(result, "data"):
59
+ return result.data
60
+ if isinstance(result, dict) and "data" in result:
61
+ return result["data"]
62
+ if isinstance(result, str):
63
+ try:
64
+ return json.loads(result)
65
+ except (json.JSONDecodeError, TypeError):
66
+ return result
67
+ return result
68
+
69
+
70
+ class SpreadsheetStepTransform(StepRewardTransform):
71
+ """Per-step reward for Spreadsheet gym (Layer 3, trajectory-aware)."""
72
+
73
+ def __init__(self, scenario: dict | None = None):
74
+ super().__init__()
75
+ self._scenario = scenario or {}
76
+ self._recent_tools: list[str] = []
77
+ self._write_counts: dict[str, int] = {}
78
+ self._last_validate_passed: int = 0
79
+ self._has_validated: bool = False
80
+
81
+ def set_scenario(self, scenario: Any) -> None:
82
+ """Set scenario context (called by runner at start of each scenario)."""
83
+ if hasattr(scenario, "id"):
84
+ self._scenario = {"id": scenario.id}
85
+ elif isinstance(scenario, dict):
86
+ self._scenario = scenario
87
+ self._recent_tools = []
88
+ self._write_counts = {}
89
+ self._last_validate_passed = 0
90
+ self._has_validated = False
91
+
92
+ def _compute_reward(self, observation: Observation) -> float:
93
+ if not isinstance(observation, CallToolObservation):
94
+ return 0.0
95
+
96
+ tool_name = getattr(observation, "tool_name", "") or ""
97
+ result = _extract_result(observation)
98
+ if not isinstance(result, dict):
99
+ result = {}
100
+
101
+ has_error = (
102
+ observation.error is not None
103
+ or (isinstance(result, dict) and result.get("error"))
104
+ )
105
+
106
+ if tool_name in NEUTRAL_TOOLS:
107
+ self._recent_tools.append(tool_name)
108
+ return 0.0
109
+
110
+ if has_error:
111
+ self._recent_tools.append(tool_name)
112
+ return -0.05
113
+
114
+ reward = self._score_tool(tool_name, result)
115
+ self._recent_tools.append(tool_name)
116
+ return reward
117
+
118
+ def _score_tool(self, tool_name: str, result: dict) -> float:
119
+ if tool_name in READ_TOOLS:
120
+ return 0.02
121
+
122
+ if tool_name == "inspect_formula":
123
+ return 0.05
124
+
125
+ if tool_name in WRITE_TOOLS:
126
+ return self._score_write(tool_name, result)
127
+
128
+ if tool_name == "validate_partial":
129
+ return self._score_validate(result)
130
+
131
+ if tool_name == "submit_workbook":
132
+ return self._score_submit(result)
133
+
134
+ return 0.0
135
+
136
+ def _score_write(self, tool_name: str, result: dict) -> float:
137
+ outside_target = result.get("outside_target", False)
138
+ if outside_target:
139
+ return -0.10
140
+
141
+ reward = 0.05
142
+
143
+ lookback = self._recent_tools[-4:]
144
+ if any(t in READ_TOOLS for t in lookback):
145
+ reward += 0.05
146
+
147
+ cell_key = f"{result.get('sheet', '')}:{result.get('cell', result.get('start_cell', ''))}"
148
+ self._write_counts[cell_key] = self._write_counts.get(cell_key, 0) + 1
149
+ if self._write_counts[cell_key] >= 3:
150
+ reward -= 0.05
151
+
152
+ return reward
153
+
154
+ def _score_validate(self, result: dict) -> float:
155
+ self._has_validated = True
156
+ new_passed = result.get("passed", 0)
157
+ if new_passed > self._last_validate_passed:
158
+ self._last_validate_passed = new_passed
159
+ return 0.10
160
+ self._last_validate_passed = new_passed
161
+ return 0.05
162
+
163
+ def _score_submit(self, result: dict) -> float:
164
+ pass_rate = result.get("pass_rate", 0)
165
+ reward = 0.0
166
+
167
+ if pass_rate == 1.0:
168
+ reward = 0.50
169
+ elif pass_rate > 0.5:
170
+ reward = 0.20
171
+ elif pass_rate < 0.3:
172
+ reward = -0.10
173
+
174
+ if not self._has_validated:
175
+ reward -= 0.05
176
+
177
+ return reward
178
+
179
+
180
+ def transform(trajectory: list, scenario: dict) -> list:
181
+ """
182
+ Apply per-step rewards to trajectory (used by run_eval transform_factory).
183
+
184
+ Returns trajectory with each step's reward populated.
185
+ """
186
+ t = SpreadsheetStepTransform(scenario=scenario)
187
+ for step in trajectory:
188
+ if hasattr(step, "observation"):
189
+ obs = step.observation
190
+ step.reward = t._compute_reward(obs)
191
+ return trajectory
run_eval.py ADDED
@@ -0,0 +1,820 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Evaluation Runner β€” run an LLM agent against Spreadsheet gym scenarios.
4
+
5
+ Single-gym version of the repo-level run_eval.py, tailored for the
6
+ spreadsheet environment. No --gym flag needed.
7
+
8
+ Usage:
9
+ # Single model
10
+ python run_eval.py --model gpt-5.4 --save --trajectory
11
+
12
+ # Multiple models in parallel
13
+ python run_eval.py --model gpt-5.4,claude-sonnet-4-6,claude-opus-4-6 --parallel 3 --save --trajectory
14
+
15
+ # Specific scenario
16
+ python run_eval.py --model gpt-5.4 --scenario formula_repair_01
17
+
18
+ # OpenEnV per-step reward mode
19
+ python run_eval.py --model gpt-5.4 --reward-mode openenv --save --trajectory
20
+
21
+ Prerequisites:
22
+ 1. pip install -e .
23
+ 2. docker build -t openenv-spreadsheet -f server/Dockerfile .
24
+ 3. docker run -d --name spreadsheet -p 8000:8000 openenv-spreadsheet
25
+ """
26
+
27
+ import argparse
28
+ import json
29
+ import logging
30
+ import os
31
+ import sys
32
+ import time
33
+ from concurrent.futures import ThreadPoolExecutor, as_completed
34
+ from datetime import datetime, timezone, timedelta
35
+ from typing import Any, Dict, List
36
+
37
+ IST = timezone(timedelta(hours=5, minutes=30))
38
+
39
+ from dotenv import load_dotenv
40
+
41
+ load_dotenv(os.path.join(os.path.dirname(__file__), ".env"))
42
+
43
+ sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
44
+
45
+ from openenv import AutoEnv
46
+
47
+ from agent.runner import AgentRunner
48
+ from rewards.base import RewardBreakdown
49
+ from rewards.checks import SpreadsheetChecker
50
+ from rewards.transforms import SpreadsheetStepTransform
51
+ from scenarios.definitions import SPREADSHEET_SCENARIOS
52
+
53
+ logger = logging.getLogger(__name__)
54
+
55
+ GYM_NAME = "spreadsheet"
56
+ OUTPUT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "outputs")
57
+
58
+
59
+ def _resolve_base_url() -> str:
60
+ import importlib.resources
61
+ import yaml
62
+
63
+ try:
64
+ ref = importlib.resources.files(GYM_NAME).joinpath("openenv.yaml")
65
+ with importlib.resources.as_file(ref) as f:
66
+ manifest = yaml.safe_load(f.read_text())
67
+ port = manifest.get("port", 8000)
68
+ return f"http://localhost:{port}"
69
+ except Exception:
70
+ logger.warning("Could not read openenv.yaml, defaulting to port 8000")
71
+ return "http://localhost:8000"
72
+
73
+
74
+ def _fetch_gym_metadata(base_url: str) -> dict | None:
75
+ import httpx
76
+
77
+ try:
78
+ resp = httpx.get(f"{base_url}/metadata", timeout=5.0)
79
+ resp.raise_for_status()
80
+ data = resp.json()
81
+ data.pop("readme_content", None)
82
+ return data
83
+ except Exception as e:
84
+ logger.debug(f"Failed to fetch /metadata from {base_url}: {e}")
85
+ return None
86
+
87
+
88
+ def divider(text: str = ""):
89
+ print(f"\n{'=' * 70}")
90
+ if text:
91
+ print(f" {text}")
92
+ print(f"{'=' * 70}")
93
+
94
+
95
+ def print_breakdown(breakdown: RewardBreakdown):
96
+ print(breakdown.summary())
97
+ print()
98
+ print(f" Details: {breakdown.details}")
99
+
100
+
101
+ def save_results_to_markdown(
102
+ results: List[Dict[str, Any]],
103
+ model: str,
104
+ output_path: str,
105
+ total_elapsed: float,
106
+ temperature: float,
107
+ run_id: str = "",
108
+ reward_mode: str = "custom",
109
+ gym_version: str = "unknown",
110
+ ):
111
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
112
+
113
+ timestamp = datetime.now(IST).strftime("%Y-%m-%d %H:%M:%S")
114
+ is_new_file = not os.path.exists(output_path)
115
+
116
+ with open(output_path, "a") as f:
117
+ if is_new_file:
118
+ f.write(f"# Spreadsheet Gym β€” Evaluation Results\n\n")
119
+ f.write(f"**Run ID**: `{run_id}` \n")
120
+ f.write(f"**Gym Version**: `{gym_version}`\n\n")
121
+ f.write(f"Evaluation results for the **spreadsheet** gym across different LLM models.\n\n")
122
+ if reward_mode == "openenv":
123
+ f.write(f"**Reward Mode**: `openenv` β€” per-step rewards from `rewards/transforms.py` + ground truth\n\n")
124
+ f.write(f"Each model is evaluated on the same set of scenarios. ")
125
+ f.write(f"Rewards are computed using OpenEnv transforms:\n")
126
+ f.write(f"- **Quality** (0.25) β€” fraction of productive steps\n")
127
+ f.write(f"- **Ground Truth** (0.60) β€” episode outcome checks\n")
128
+ f.write(f"- **Efficiency** (0.15) β€” step budget usage\n")
129
+ f.write(f"- **Hallucination Penalty** β€” tools say success but ground truth disagrees\n\n")
130
+ else:
131
+ f.write(f"**Reward Mode**: `custom` β€” episode-level rewards from `rewards/base.py`\n\n")
132
+ f.write(f"Each model is evaluated on the same set of scenarios. ")
133
+ f.write(f"Rewards are computed by `rewards/base.py` using:\n")
134
+ f.write(f"- **Structural** (0.25) β€” right tools called, no errors\n")
135
+ f.write(f"- **Ground Truth** (0.60) β€” episode outcome checks\n")
136
+ f.write(f"- **Efficiency** (0.15) β€” solved in reasonable steps\n")
137
+ f.write(f"- **Hallucination Penalty** β€” tools say success but ground truth disagrees\n\n")
138
+ f.write(f"Trajectories: `outputs/trajectories/{run_id}/`\n\n")
139
+ f.write(f"---\n\n")
140
+
141
+ safe_model = model.replace("/", "_").replace(":", "_")
142
+ f.write(f"## Model: `{model}`\n\n")
143
+ f.write(f"- **Date**: {timestamp}\n")
144
+ f.write(f"- **Temperature**: {temperature}\n")
145
+ f.write(f"- **Reward Mode**: {reward_mode}\n")
146
+ f.write(f"- **Total Time**: {total_elapsed:.1f}s\n")
147
+ f.write(f"- **Trajectory**: `outputs/trajectories/{run_id}/{safe_model}.json`\n\n")
148
+
149
+ if reward_mode == "openenv":
150
+ f.write(f"| Scenario | Quality | Ground Truth | Penalty | **Total** | Steps | Time |\n")
151
+ f.write(f"|---|:---:|:---:|:---:|:---:|:---:|:---:|\n")
152
+ else:
153
+ f.write(f"| Scenario | Structural | Ground Truth | Efficiency | Penalty | **Total** | Steps | Time |\n")
154
+ f.write(f"|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n")
155
+
156
+ total_reward = 0.0
157
+ for r in results:
158
+ bd = r.get("breakdown")
159
+ if bd:
160
+ if reward_mode == "openenv":
161
+ f.write(
162
+ f"| {r['scenario']} "
163
+ f"| {bd.structural:.2f} "
164
+ f"| {bd.ground_truth:.2f} "
165
+ f"| {bd.penalty:.2f} "
166
+ f"| **{bd.total:.2f}** "
167
+ f"| {r['steps']} "
168
+ f"| {r['elapsed']:.1f}s |\n"
169
+ )
170
+ else:
171
+ f.write(
172
+ f"| {r['scenario']} "
173
+ f"| {bd.structural:.2f} "
174
+ f"| {bd.ground_truth:.2f} "
175
+ f"| {bd.efficiency:.2f} "
176
+ f"| {bd.penalty:.2f} "
177
+ f"| **{bd.total:.2f}** "
178
+ f"| {r['steps']} "
179
+ f"| {r['elapsed']:.1f}s |\n"
180
+ )
181
+ total_reward += bd.total
182
+ else:
183
+ cols = "| β€” | β€” | β€” " if reward_mode == "openenv" else "| β€” | β€” | β€” | β€” "
184
+ f.write(
185
+ f"| {r['scenario']} "
186
+ f"{cols}"
187
+ f"| **ERROR** "
188
+ f"| {r['steps']} "
189
+ f"| {r['elapsed']:.1f}s |\n"
190
+ )
191
+
192
+ avg = total_reward / len(results) if results else 0.0
193
+ f.write(f"\n**Average Reward: {avg:.2f}**\n\n")
194
+ f.write(f"---\n\n")
195
+
196
+ logger.info(f"Results saved to {output_path}")
197
+
198
+
199
+ def save_trajectory(
200
+ results: List[Dict[str, Any]],
201
+ scenarios: list,
202
+ model: str,
203
+ temperature: float,
204
+ total_elapsed: float,
205
+ run_id: str = "",
206
+ reward_mode: str = "custom",
207
+ gym_version: str = "unknown",
208
+ ):
209
+ run_ts = datetime.now(IST).isoformat()
210
+
211
+ safe_model = model.replace("/", "_").replace(":", "_")
212
+ filename = f"{safe_model}.json"
213
+
214
+ traj_dir = os.path.join(OUTPUT_DIR, "trajectories", run_id)
215
+ os.makedirs(traj_dir, exist_ok=True)
216
+ filepath = os.path.join(traj_dir, filename)
217
+
218
+ trajectory = {
219
+ "run_id": run_id or "untagged",
220
+ "model": model,
221
+ "gym": GYM_NAME,
222
+ "gym_version": gym_version,
223
+ "timestamp": run_ts,
224
+ "temperature": temperature,
225
+ "reward_mode": reward_mode,
226
+ "total_elapsed_s": round(total_elapsed, 2),
227
+ "total_scenarios": len(results),
228
+ "scenarios": [],
229
+ }
230
+
231
+ for r, scenario in zip(results, scenarios):
232
+ scenario_entry = {
233
+ "scenario_id": scenario.id,
234
+ "prompt": scenario.prompt,
235
+ "expected_tools": scenario.expected_tools,
236
+ "max_steps": scenario.max_steps,
237
+ "elapsed_s": round(r["elapsed"], 2),
238
+ }
239
+
240
+ episode = r.get("episode")
241
+ if episode:
242
+ steps = []
243
+ for i, step in enumerate(episode.steps, 1):
244
+ result_data = step.result
245
+ if isinstance(result_data, str):
246
+ try:
247
+ result_data = json.loads(result_data)
248
+ except (json.JSONDecodeError, TypeError):
249
+ pass
250
+
251
+ steps.append({
252
+ "step": i,
253
+ "timestamp": step.timestamp,
254
+ "tool_name": step.tool_name,
255
+ "arguments": step.arguments,
256
+ "success": step.success,
257
+ "result": result_data,
258
+ "error": step.error,
259
+ "elapsed_s": round(step.elapsed, 3),
260
+ })
261
+ scenario_entry["steps"] = steps
262
+ scenario_entry["total_steps"] = len(steps)
263
+ else:
264
+ scenario_entry["steps"] = []
265
+ scenario_entry["total_steps"] = 0
266
+ scenario_entry["error"] = r.get("error", "Unknown error")
267
+
268
+ outcome_results = r.get("outcome_results", [])
269
+ checks = []
270
+ for check_def, passed in zip(scenario.outcome_checks, outcome_results):
271
+ checks.append({
272
+ "check": check_def,
273
+ "passed": passed,
274
+ })
275
+ scenario_entry["outcome_checks"] = checks
276
+
277
+ bd = r.get("breakdown")
278
+ if bd:
279
+ scenario_entry["reward"] = {
280
+ "structural": round(bd.structural, 4),
281
+ "ground_truth": round(bd.ground_truth, 4),
282
+ "efficiency": round(bd.efficiency, 4),
283
+ "penalty": round(bd.penalty, 4),
284
+ "total": round(bd.total, 4),
285
+ }
286
+ else:
287
+ scenario_entry["reward"] = None
288
+
289
+ trajectory["scenarios"].append(scenario_entry)
290
+
291
+ totals = [s["reward"]["total"] for s in trajectory["scenarios"] if s.get("reward")]
292
+ trajectory["avg_reward"] = round(sum(totals) / len(totals), 4) if totals else 0.0
293
+
294
+ with open(filepath, "w") as f:
295
+ json.dump(trajectory, f, indent=2, default=str)
296
+
297
+ print(f"\n Trajectory saved: {filepath}")
298
+ logger.info(f"Trajectory saved to {filepath}")
299
+ return filepath
300
+
301
+
302
+ # ── Model Workers ──
303
+
304
+ def _run_single_model(
305
+ model: str,
306
+ base_url: str,
307
+ scenarios: list,
308
+ temperature: float,
309
+ max_tokens: int,
310
+ reward_mode: str,
311
+ run_id: str,
312
+ save: bool,
313
+ trajectory: bool,
314
+ verbose: bool,
315
+ gym_version: str = "unknown",
316
+ ) -> Dict[str, Any]:
317
+ model_start = time.time()
318
+ model_results = []
319
+
320
+ def _connect():
321
+ client = AutoEnv.from_env(GYM_NAME, base_url=base_url)
322
+ client.__enter__()
323
+ xform = SpreadsheetStepTransform() if reward_mode == "openenv" else None
324
+ rnr = AgentRunner(
325
+ model=model,
326
+ env_client=client,
327
+ temperature=temperature,
328
+ max_tokens=max_tokens,
329
+ reward_mode=reward_mode,
330
+ transform=xform,
331
+ )
332
+ return client, rnr
333
+
334
+ env_client, runner = _connect()
335
+ checker = SpreadsheetChecker()
336
+
337
+ WS_RETRY_ERRORS = ("ConnectionClosed", "ConnectionClosedOK", "ConnectionClosedError", "sent 1000")
338
+ MAX_WS_RETRIES = 3
339
+
340
+ try:
341
+ for i, scenario in enumerate(scenarios, 1):
342
+ print(f"\n [{model}] Scenario {i}/{len(scenarios)}: {scenario.id}")
343
+
344
+ start = time.time()
345
+ last_error = None
346
+ for attempt in range(MAX_WS_RETRIES + 1):
347
+ try:
348
+ if attempt > 0:
349
+ logger.info(f"[{model}] Reconnecting (attempt {attempt + 1}) for {scenario.id}")
350
+ print(f" [{model}] Reconnecting WebSocket (attempt {attempt + 1})...")
351
+ try:
352
+ env_client.__exit__(None, None, None)
353
+ except Exception:
354
+ pass
355
+ time.sleep(2 * attempt)
356
+ env_client, runner = _connect()
357
+
358
+ episode, breakdown = runner.run_scenario(scenario, checker)
359
+ elapsed = time.time() - start
360
+
361
+ if hasattr(checker, "set_episode"):
362
+ checker.set_episode(episode)
363
+
364
+ outcome_results = checker.check_all(scenario.outcome_checks)
365
+
366
+ model_results.append({
367
+ "scenario": scenario.id,
368
+ "total_reward": breakdown.total,
369
+ "breakdown": breakdown,
370
+ "steps": len(episode.steps),
371
+ "elapsed": elapsed,
372
+ "episode": episode,
373
+ "outcome_results": outcome_results,
374
+ })
375
+
376
+ print(f" [{model}] {scenario.id}: {breakdown.total:.2f} ({len(episode.steps)} steps, {elapsed:.1f}s)")
377
+ last_error = None
378
+ break
379
+
380
+ except Exception as e:
381
+ last_error = e
382
+ is_ws_error = any(tok in type(e).__name__ or tok in str(e) for tok in WS_RETRY_ERRORS)
383
+ if is_ws_error and attempt < MAX_WS_RETRIES:
384
+ logger.warning(f"[{model}] WebSocket error on {scenario.id}: {e}")
385
+ continue
386
+ raise
387
+
388
+ if last_error is not None:
389
+ elapsed = time.time() - start
390
+ logger.exception(f"[{model}] Scenario {scenario.id} failed")
391
+ model_results.append({
392
+ "scenario": scenario.id,
393
+ "total_reward": 0.0,
394
+ "breakdown": None,
395
+ "steps": 0,
396
+ "elapsed": elapsed,
397
+ "error": str(last_error),
398
+ })
399
+ print(f" [{model}] {scenario.id}: ERROR - {last_error}")
400
+
401
+ finally:
402
+ try:
403
+ env_client.__exit__(None, None, None)
404
+ except Exception:
405
+ pass
406
+
407
+ model_elapsed = time.time() - model_start
408
+
409
+ if save:
410
+ output_path = os.path.join(OUTPUT_DIR, "results", f"{run_id}.md")
411
+ save_results_to_markdown(
412
+ results=model_results,
413
+ model=model,
414
+ output_path=output_path,
415
+ total_elapsed=model_elapsed,
416
+ temperature=temperature,
417
+ run_id=run_id,
418
+ reward_mode=reward_mode,
419
+ gym_version=gym_version,
420
+ )
421
+
422
+ if trajectory:
423
+ save_trajectory(
424
+ results=model_results,
425
+ scenarios=scenarios,
426
+ model=model,
427
+ temperature=temperature,
428
+ total_elapsed=model_elapsed,
429
+ run_id=run_id,
430
+ reward_mode=reward_mode,
431
+ gym_version=gym_version,
432
+ )
433
+
434
+ return {
435
+ "model": model,
436
+ "results": model_results,
437
+ "elapsed": model_elapsed,
438
+ }
439
+
440
+
441
+ def _run_single_model_detailed(
442
+ model: str,
443
+ base_url: str,
444
+ scenarios: list,
445
+ temperature: float,
446
+ max_tokens: int,
447
+ reward_mode: str,
448
+ run_id: str,
449
+ save: bool,
450
+ trajectory: bool,
451
+ gym_version: str = "unknown",
452
+ ) -> Dict[str, Any]:
453
+ model_start = time.time()
454
+ results = []
455
+
456
+ env_client = AutoEnv.from_env(GYM_NAME, base_url=base_url)
457
+ env_client.__enter__()
458
+
459
+ checker = SpreadsheetChecker()
460
+
461
+ transform = SpreadsheetStepTransform() if reward_mode == "openenv" else None
462
+
463
+ runner = AgentRunner(
464
+ model=model,
465
+ env_client=env_client,
466
+ temperature=temperature,
467
+ max_tokens=max_tokens,
468
+ reward_mode=reward_mode,
469
+ transform=transform,
470
+ )
471
+
472
+ try:
473
+ for i, scenario in enumerate(scenarios, 1):
474
+ divider(f"Scenario {i}/{len(scenarios)}: {scenario.id}")
475
+ print(f" Prompt: {scenario.prompt[:120]}...")
476
+ print(f" Expected tools: {scenario.expected_tools}")
477
+ print(f" Max steps: {scenario.max_steps}")
478
+ print()
479
+
480
+ start = time.time()
481
+ try:
482
+ episode, breakdown = runner.run_scenario(scenario, checker)
483
+ elapsed = time.time() - start
484
+
485
+ print()
486
+ print(" -- Agent Actions --")
487
+ for step in episode.steps:
488
+ status = "OK" if step.success else "FAIL"
489
+ args_str = _short_json(step.arguments)
490
+ print(f" [{status}] {step.tool_name}({args_str})")
491
+ print(f" Steps taken: {len(episode.steps)}")
492
+
493
+ if hasattr(checker, "set_episode"):
494
+ checker.set_episode(episode)
495
+
496
+ print()
497
+ print(" -- Ground Truth Verification --")
498
+ outcome_results = checker.check_all(scenario.outcome_checks)
499
+ for check, score in zip(scenario.outcome_checks, outcome_results):
500
+ status = "PASS" if score else "FAIL"
501
+ label = _check_label(check)
502
+ print(f" [{status}] {check['type']}: {label}")
503
+
504
+ print()
505
+ print(" -- Reward Breakdown --")
506
+ print_breakdown(breakdown)
507
+ print(f"\n Completed in {elapsed:.1f}s")
508
+
509
+ results.append({
510
+ "scenario": scenario.id,
511
+ "total_reward": breakdown.total,
512
+ "breakdown": breakdown,
513
+ "steps": len(episode.steps),
514
+ "elapsed": elapsed,
515
+ "episode": episode,
516
+ "outcome_results": outcome_results,
517
+ })
518
+
519
+ except Exception as e:
520
+ elapsed = time.time() - start
521
+ print(f"\n ERROR: {e}")
522
+ logger.exception(f"Scenario {scenario.id} failed")
523
+ results.append({
524
+ "scenario": scenario.id,
525
+ "total_reward": 0.0,
526
+ "breakdown": None,
527
+ "steps": 0,
528
+ "elapsed": elapsed,
529
+ "error": str(e),
530
+ })
531
+
532
+ finally:
533
+ env_client.__exit__(None, None, None)
534
+ logger.info("AutoEnv client disconnected.")
535
+
536
+ model_elapsed = time.time() - model_start
537
+
538
+ if save:
539
+ output_path = os.path.join(OUTPUT_DIR, "results", f"{run_id}.md")
540
+ save_results_to_markdown(
541
+ results=results,
542
+ model=model,
543
+ output_path=output_path,
544
+ total_elapsed=model_elapsed,
545
+ temperature=temperature,
546
+ run_id=run_id,
547
+ reward_mode=reward_mode,
548
+ gym_version=gym_version,
549
+ )
550
+ print(f"\n Results saved: {output_path}")
551
+
552
+ if trajectory:
553
+ save_trajectory(
554
+ results=results,
555
+ scenarios=scenarios,
556
+ model=model,
557
+ temperature=temperature,
558
+ total_elapsed=model_elapsed,
559
+ run_id=run_id,
560
+ reward_mode=reward_mode,
561
+ gym_version=gym_version,
562
+ )
563
+
564
+ return {
565
+ "model": model,
566
+ "results": results,
567
+ "elapsed": model_elapsed,
568
+ }
569
+
570
+
571
+ def _check_label(check: dict) -> str:
572
+ for key in ("min_score", "min_pct", "max_hits"):
573
+ if key in check and key != "type":
574
+ return str(check[key])
575
+ return check.get("type", "?")
576
+
577
+
578
+ def _short_json(obj, max_len=80):
579
+ s = json.dumps(obj, default=str)
580
+ return s if len(s) <= max_len else s[:max_len] + "..."
581
+
582
+
583
+ def main():
584
+ parser = argparse.ArgumentParser(
585
+ description="Evaluate an LLM agent against Spreadsheet gym scenarios.",
586
+ formatter_class=argparse.RawDescriptionHelpFormatter,
587
+ epilog="""
588
+ Examples:
589
+ python run_eval.py --model gpt-5.4 --save --trajectory
590
+ python run_eval.py --model gpt-5.4,claude-sonnet-4-6 --parallel 2 --reward-mode openenv
591
+ python run_eval.py --model gpt-5.4 --scenario formula_repair_01
592
+ """,
593
+ )
594
+ parser.add_argument(
595
+ "--model",
596
+ default=os.getenv("LLM_MODEL", "gpt-4o"),
597
+ help="LiteLLM model string, or comma-separated for parallel mode "
598
+ "(e.g., 'gpt-5.4' or 'gpt-5.4,claude-sonnet-4-6')",
599
+ )
600
+ parser.add_argument(
601
+ "--scenario",
602
+ default=None,
603
+ help="Run a specific scenario by ID (default: run all 12)",
604
+ )
605
+ parser.add_argument(
606
+ "--temperature",
607
+ type=float,
608
+ default=float(os.getenv("LLM_TEMPERATURE", "0.0")),
609
+ help="LLM sampling temperature (default: 0.0)",
610
+ )
611
+ parser.add_argument(
612
+ "--max-tokens",
613
+ type=int,
614
+ default=int(os.getenv("LLM_MAX_TOKENS", "1024")),
615
+ help="Max tokens per LLM response (default: 1024)",
616
+ )
617
+ parser.add_argument(
618
+ "--save",
619
+ action="store_true",
620
+ help="Save results to outputs/results/<run_id>.md",
621
+ )
622
+ parser.add_argument(
623
+ "--trajectory",
624
+ action="store_true",
625
+ help="Save detailed trajectory JSON to outputs/trajectories/<run_id>/",
626
+ )
627
+ parser.add_argument(
628
+ "--run-id",
629
+ default=None,
630
+ help="Run identifier (default: auto-generated as run_YYYYMMDD_HHMM)",
631
+ )
632
+ parser.add_argument(
633
+ "--reward-mode",
634
+ default="custom",
635
+ choices=["custom", "openenv"],
636
+ help="Reward mode: 'custom' (episode-level) or 'openenv' (per-step). Default: custom",
637
+ )
638
+ parser.add_argument(
639
+ "--parallel",
640
+ type=int,
641
+ default=1,
642
+ help="Number of models to evaluate in parallel (default: 1 = sequential)",
643
+ )
644
+ parser.add_argument(
645
+ "--verbose", "-v",
646
+ action="store_true",
647
+ help="Enable debug logging",
648
+ )
649
+
650
+ args = parser.parse_args()
651
+
652
+ models = [m.strip() for m in args.model.split(",") if m.strip()]
653
+
654
+ if args.run_id:
655
+ run_id = args.run_id
656
+ else:
657
+ run_id = f"run_{datetime.now(IST).strftime('%Y%m%d_%H%M')}"
658
+
659
+ log_level = logging.DEBUG if args.verbose else logging.INFO
660
+ logging.basicConfig(
661
+ level=log_level,
662
+ format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
663
+ datefmt="%H:%M:%S",
664
+ )
665
+
666
+ base_url = _resolve_base_url()
667
+
668
+ scenarios = SPREADSHEET_SCENARIOS
669
+ if args.scenario:
670
+ scenarios = [s for s in scenarios if s.id == args.scenario]
671
+ if not scenarios:
672
+ available = [s.id for s in SPREADSHEET_SCENARIOS]
673
+ print(f"Error: Scenario '{args.scenario}' not found. Available: {available}")
674
+ sys.exit(1)
675
+
676
+ divider("AutoEnv Discovery")
677
+ print(f" Discovering gym '{GYM_NAME}' via AutoEnv...")
678
+ env_info = AutoEnv.get_env_info(GYM_NAME)
679
+ print(f" Found: {env_info['name']} (package: {env_info['package']}, v{env_info['version']})")
680
+ print(f" Base URL: {base_url} (auto-derived from openenv.yaml)")
681
+
682
+ gym_metadata = _fetch_gym_metadata(base_url)
683
+ if gym_metadata:
684
+ print(f"\n -- Environment Metadata (GET {base_url}/metadata) --")
685
+ print(f" Name: {gym_metadata.get('name', 'N/A')}")
686
+ print(f" Version: {gym_metadata.get('version', 'N/A')}")
687
+ print(f" Description: {gym_metadata.get('description', 'N/A')}")
688
+ else:
689
+ print(f"\n Warning: Could not fetch /metadata from {base_url} (server may not be running)")
690
+
691
+ is_parallel = args.parallel > 1 and len(models) > 1
692
+ mode_str = f"Parallel ({args.parallel} workers)" if is_parallel else "Sequential"
693
+ gym_version = gym_metadata.get("version", "unknown") if gym_metadata else "unknown"
694
+
695
+ divider("LLM Evaluation Run")
696
+ print(f" Gym: {GYM_NAME} (v{gym_version})")
697
+ print(f" Models: {', '.join(models)}")
698
+ print(f" Run ID: {run_id}")
699
+ print(f" Mode: {mode_str}")
700
+ print(f" Base URL: {base_url}")
701
+ print(f" Scenarios: {len(scenarios)} of {len(SPREADSHEET_SCENARIOS)}")
702
+ print(f" Temperature: {args.temperature}")
703
+ print(f" Reward Mode: {args.reward_mode}")
704
+ print(f" Output Dir: {OUTPUT_DIR}")
705
+
706
+ total_start = time.time()
707
+ all_model_results = []
708
+
709
+ if is_parallel:
710
+ divider(f"Parallel Evaluation ({len(models)} models, {args.parallel} workers)")
711
+
712
+ max_workers = min(args.parallel, len(models))
713
+ with ThreadPoolExecutor(max_workers=max_workers) as executor:
714
+ futures = {}
715
+ for idx, model in enumerate(models):
716
+ if idx > 0:
717
+ time.sleep(3)
718
+ future = executor.submit(
719
+ _run_single_model,
720
+ model=model,
721
+ base_url=base_url,
722
+ scenarios=scenarios,
723
+ temperature=args.temperature,
724
+ max_tokens=args.max_tokens,
725
+ reward_mode=args.reward_mode,
726
+ run_id=run_id,
727
+ save=args.save,
728
+ trajectory=args.trajectory,
729
+ verbose=args.verbose,
730
+ gym_version=gym_version,
731
+ )
732
+ futures[future] = model
733
+
734
+ for future in as_completed(futures):
735
+ model = futures[future]
736
+ try:
737
+ result = future.result()
738
+ all_model_results.append(result)
739
+ print(f"\n {model} completed in {result['elapsed']:.1f}s")
740
+ except Exception as e:
741
+ print(f"\n {model} FAILED: {e}")
742
+ logger.exception(f"Model {model} failed")
743
+ all_model_results.append({
744
+ "model": model,
745
+ "results": [],
746
+ "elapsed": 0.0,
747
+ "error": str(e),
748
+ })
749
+ else:
750
+ for model in models:
751
+ if len(models) > 1:
752
+ divider(f"Model: {model}")
753
+
754
+ if len(models) == 1:
755
+ result = _run_single_model_detailed(
756
+ model=model,
757
+ base_url=base_url,
758
+ scenarios=scenarios,
759
+ temperature=args.temperature,
760
+ max_tokens=args.max_tokens,
761
+ reward_mode=args.reward_mode,
762
+ run_id=run_id,
763
+ save=args.save,
764
+ trajectory=args.trajectory,
765
+ gym_version=gym_version,
766
+ )
767
+ else:
768
+ result = _run_single_model(
769
+ model=model,
770
+ base_url=base_url,
771
+ scenarios=scenarios,
772
+ temperature=args.temperature,
773
+ max_tokens=args.max_tokens,
774
+ reward_mode=args.reward_mode,
775
+ run_id=run_id,
776
+ save=args.save,
777
+ trajectory=args.trajectory,
778
+ verbose=args.verbose,
779
+ gym_version=gym_version,
780
+ )
781
+ all_model_results.append(result)
782
+
783
+ total_elapsed = time.time() - total_start
784
+ divider("Evaluation Summary")
785
+
786
+ for mr in all_model_results:
787
+ model = mr["model"]
788
+ results = mr.get("results", [])
789
+ model_elapsed = mr.get("elapsed", 0.0)
790
+
791
+ if not results:
792
+ print(f"\n Model: {model} -- FAILED ({mr.get('error', 'unknown')})")
793
+ continue
794
+
795
+ total_reward = sum(r["total_reward"] for r in results)
796
+ avg_reward = total_reward / len(results) if results else 0.0
797
+
798
+ print(f"\n Model: {model}")
799
+ print(f" Time: {model_elapsed:.1f}s")
800
+ print(f" {'Scenario':<35} {'Reward':>8} {'Steps':>6} {'Time':>6}")
801
+ print(f" {'-' * 35} {'-' * 8} {'-' * 6} {'-' * 6}")
802
+
803
+ for r in results:
804
+ reward_str = f"{r['total_reward']:.2f}" if r.get("breakdown") else "ERROR"
805
+ print(f" {r['scenario']:<35} {reward_str:>8} {r['steps']:>6} {r['elapsed']:>5.1f}s")
806
+
807
+ print(f" {'-' * 35} {'-' * 8} {'-' * 6} {'-' * 6}")
808
+ print(f" {'AVERAGE':<35} {avg_reward:>8.2f}")
809
+
810
+ if len(models) > 1:
811
+ print(f"\n Total time (all models): {total_elapsed:.1f}s")
812
+ if is_parallel:
813
+ seq_time = sum(mr.get("elapsed", 0.0) for mr in all_model_results)
814
+ speedup = seq_time / total_elapsed if total_elapsed > 0 else 1.0
815
+ print(f" Sequential equivalent: {seq_time:.1f}s")
816
+ print(f" Speedup: {speedup:.1f}x")
817
+
818
+
819
+ if __name__ == "__main__":
820
+ main()
scenarios/__init__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ from .definitions import SPREADSHEET_SCENARIOS
2
+
3
+ __all__ = ["SPREADSHEET_SCENARIOS"]
scenarios/definitions.py ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Scenario loader for the Spreadsheet gym (used by run_eval.py)."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ from pathlib import Path
7
+
8
+ from ..rewards.base import Scenario
9
+
10
+ _PKG_ROOT = Path(__file__).resolve().parent.parent
11
+ _SCENARIOS_DIR = _PKG_ROOT / "scenarios"
12
+ _HIDDEN_TESTS_DIR = _PKG_ROOT / "workbooks" / "hidden_tests"
13
+
14
+ _BASE_TOOLS = ["load_scenario", "list_sheets", "read_range", "submit_workbook"]
15
+
16
+ _CATEGORY_TOOLS: dict[str, list[str]] = {
17
+ "formula_repair": _BASE_TOOLS + ["inspect_formula", "write_cell", "validate_partial"],
18
+ "cross_sheet_lookup": _BASE_TOOLS + ["write_cell", "write_range"],
19
+ "messy_table_extraction": _BASE_TOOLS + ["write_range"],
20
+ "schedule_grid_fill": _BASE_TOOLS + ["write_cell", "write_range", "validate_partial"],
21
+ "ledger_reconciliation": _BASE_TOOLS + ["write_range"],
22
+ "range_transformation": _BASE_TOOLS + ["write_range"],
23
+ "conditional_aggregation": _BASE_TOOLS + ["write_range"],
24
+ "buggy_template_fix": _BASE_TOOLS + ["inspect_formula", "write_cell", "write_range", "validate_partial"],
25
+ }
26
+
27
+ _DEFAULT_TOOLS = _BASE_TOOLS + ["write_cell", "write_range", "validate_partial"]
28
+
29
+
30
+ def _load_scenarios_from_json() -> list[Scenario]:
31
+ if not _SCENARIOS_DIR.is_dir():
32
+ return []
33
+
34
+ scenarios = []
35
+ for f in sorted(_SCENARIOS_DIR.glob("*.json")):
36
+ data = json.loads(f.read_text(encoding="utf-8"))
37
+ sid = data.get("id", f.stem)
38
+ prompt = data.get("instructions", data.get("description", ""))
39
+ if not prompt.lower().startswith("load scenario"):
40
+ prompt = f"Load scenario '{sid}'. {prompt}"
41
+
42
+ category = data.get("category", "")
43
+ expected_tools = _CATEGORY_TOOLS.get(category, _DEFAULT_TOOLS)
44
+
45
+ outcome_checks = []
46
+ hidden_test_path = _HIDDEN_TESTS_DIR / f"{sid}.json"
47
+ if hidden_test_path.is_file():
48
+ ht = json.loads(hidden_test_path.read_text(encoding="utf-8"))
49
+ checks = ht.get("checks", [])
50
+ outcome_checks.append({
51
+ "type": "hidden_test_pass_rate",
52
+ "total_checks": len(checks),
53
+ "min_pass_rate": 0.5,
54
+ })
55
+
56
+ scenarios.append(Scenario(
57
+ id=sid,
58
+ prompt=prompt,
59
+ expected_tools=expected_tools,
60
+ max_steps=data.get("max_steps", 50),
61
+ outcome_checks=outcome_checks,
62
+ ))
63
+
64
+ return scenarios
65
+
66
+
67
+ SPREADSHEET_SCENARIOS = _load_scenarios_from_json()
68
+
69
+ __all__ = ["SPREADSHEET_SCENARIOS"]
server/Dockerfile ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
2
+ FROM ${BASE_IMAGE} AS builder
3
+
4
+ RUN apt-get update && \
5
+ apt-get install -y --no-install-recommends git curl && \
6
+ rm -rf /var/lib/apt/lists/*
7
+
8
+ WORKDIR /app
9
+ COPY . /app/env
10
+ WORKDIR /app/env
11
+
12
+ RUN if ! command -v uv >/dev/null 2>&1; then \
13
+ curl -LsSf https://astral.sh/uv/install.sh | sh && \
14
+ mv /root/.local/bin/uv /usr/local/bin/uv; \
15
+ fi
16
+
17
+ RUN --mount=type=cache,target=/root/.cache/uv \
18
+ if [ -f uv.lock ]; then uv sync --frozen --no-install-project --no-editable; \
19
+ else uv sync --no-install-project --no-editable; fi
20
+
21
+ RUN --mount=type=cache,target=/root/.cache/uv \
22
+ if [ -f uv.lock ]; then uv sync --frozen --no-editable; \
23
+ else uv sync --no-editable; fi
24
+
25
+ FROM ${BASE_IMAGE}
26
+ WORKDIR /app
27
+ COPY --from=builder /app/env/.venv /app/.venv
28
+ COPY --from=builder /app/env /app/env
29
+
30
+ ENV PATH="/app/.venv/bin:$PATH"
31
+ ENV PYTHONPATH="/app/env:$PYTHONPATH"
32
+ ENV ENABLE_WEB_INTERFACE=true
33
+ ENV WORKBOOKS_DIR=/app/env/workbooks
34
+ ENV SCENARIOS_DIR=/app/env/scenarios
35
+
36
+ EXPOSE 8000
37
+
38
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
39
+ CMD curl -sf http://localhost:8000/health || exit 1
40
+
41
+ CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]
server/app.py CHANGED
@@ -6,6 +6,9 @@ import os
6
  import sys
7
  from pathlib import Path
8
 
 
 
 
9
  try:
10
  from openenv.core.env_server.http_server import create_app
11
  except ImportError as e:
@@ -13,6 +16,76 @@ except ImportError as e:
13
  "openenv is required. Install with: uv sync"
14
  ) from e
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  try:
17
  from spreadsheet.models import SpreadsheetAction, SpreadsheetObservation
18
  from spreadsheet.server.spreadsheet_environment import SpreadsheetEnvironment
@@ -31,6 +104,13 @@ app = create_app(
31
  max_concurrent_envs=MAX_CONCURRENT_ENVS,
32
  )
33
 
 
 
 
 
 
 
 
34
 
35
  def main(host: str = "0.0.0.0", port: int = 8000):
36
  import uvicorn
 
6
  import sys
7
  from pathlib import Path
8
 
9
+ from dotenv import load_dotenv
10
+ from fastapi.middleware.cors import CORSMiddleware
11
+
12
  try:
13
  from openenv.core.env_server.http_server import create_app
14
  except ImportError as e:
 
16
  "openenv is required. Install with: uv sync"
17
  ) from e
18
 
19
+ load_dotenv(os.path.join(os.path.dirname(__file__), "..", ".env"))
20
+
21
+ import openenv.core.env_server.web_interface as _wi # noqa: E402
22
+
23
+ _wi.DEFAULT_QUICK_START_MARKDOWN = """
24
+ ### How to use this environment
25
+
26
+ **Spreadsheet** β€” exact workbook manipulation and reasoning over realistic spreadsheet tasks. Read sheets, understand structure, write values/formulas, and submit for automated evaluation.
27
+
28
+ Use the **Playground** on the right. Type a **Tool Name** and **Arguments Json**, then click **Step**.
29
+
30
+ ---
31
+
32
+ #### 1. Start a session
33
+
34
+ 1. Click **Reset**
35
+ 2. `list_tools` β†’ `{}` β€” discover all 13 tools & their params
36
+ 3. `list_scenarios` β†’ `{}` β€” see all 12 scenarios
37
+ 4. `load_scenario` β†’ `{"scenario_id": "formula_repair_01"}`
38
+
39
+ #### 2. Explore the workbook
40
+
41
+ - `list_sheets` β†’ `{}` β€” sheet names, dimensions, visibility
42
+ - `read_range` β†’ `{"sheet": "Summary", "range": "A1:F10"}` β€” read cells
43
+ - `inspect_formula` β†’ `{"sheet": "Summary", "cell": "C5"}` β€” raw formula string
44
+ - `list_named_targets` β†’ `{}` β€” allowed output zones
45
+ - `get_session_info` β†’ `{}` β€” session metadata, step count
46
+ - `get_edit_history` β†’ `{}` β€” all edits so far
47
+
48
+ > **Note:** `read_range` uses **A1 notation** (e.g. `"B2:D10"`). Formulas are returned as strings.
49
+
50
+ #### 3. Edit cells
51
+
52
+ - `write_cell` β†’ `{"sheet": "Summary", "cell": "C5", "value": "=SUM(B2:B10)"}` β€” write one cell
53
+ - `write_range` β†’ `{"sheet": "Summary", "start_cell": "A1", "data": "[[1, 2], [3, 4]]"}` β€” write a block
54
+
55
+ > **Note:** `write_range` uses **start_cell** (not `cell`). The `data` arg is a JSON string of a 2D array.
56
+
57
+ > Values starting with `=` are treated as formulas. Numeric strings are auto-converted.
58
+
59
+ #### 4. Validate & submit
60
+
61
+ - `validate_partial` β†’ `{}` β€” check progress (pass/fail count, no answers revealed)
62
+ - `submit_workbook` β†’ `{}` β€” final evaluation (pass rate + per-check results)
63
+ - `reset_scenario` β†’ `{}` β€” restore workbook to original (scenario stays loaded)
64
+
65
+ > Use `validate_partial` before `submit_workbook` to gauge progress without ending the task.
66
+
67
+ ---
68
+
69
+ #### Scenarios (12)
70
+
71
+ `formula_repair_01` Β· `formula_repair_02` Β· `cross_sheet_lookup_01` Β· `cross_sheet_lookup_02` Β· `conditional_aggregation_01` Β· `conditional_aggregation_02` Β· `ledger_reconciliation_01` Β· `ledger_reconciliation_02` Β· `messy_table_extraction_01` Β· `range_transformation_01` Β· `schedule_grid_fill_01` Β· `buggy_template_fix_01`
72
+
73
+ #### Connect from Python
74
+
75
+ ```python
76
+ from spreadsheet import SpreadsheetAction, SpreadsheetEnv
77
+
78
+ env = SpreadsheetEnv(base_url="http://localhost:8000")
79
+ obs = env.reset()
80
+ obs = await env.step(SpreadsheetAction(
81
+ tool_name="load_scenario",
82
+ arguments_json='{"scenario_id": "formula_repair_01"}'
83
+ ))
84
+ ```
85
+
86
+ For more, see the [OpenEnv documentation](https://meta-pytorch.org/OpenEnv/).
87
+ """
88
+
89
  try:
90
  from spreadsheet.models import SpreadsheetAction, SpreadsheetObservation
91
  from spreadsheet.server.spreadsheet_environment import SpreadsheetEnvironment
 
104
  max_concurrent_envs=MAX_CONCURRENT_ENVS,
105
  )
106
 
107
+ app.add_middleware(
108
+ CORSMiddleware,
109
+ allow_origins=["*"],
110
+ allow_methods=["*"],
111
+ allow_headers=["*"],
112
+ )
113
+
114
 
115
  def main(host: str = "0.0.0.0", port: int = 8000):
116
  import uvicorn
workbooks/fixtures/13597ec4-95ae-4293-a2d1-aec276ac80e9_sales_commission.xlsx ADDED
Binary file (7.79 kB). View file
 
workbooks/fixtures/6dd7822d-39b9-4134-80ad-b7e653ad9944_product_revenue_by_region.xlsx ADDED
Binary file (11.8 kB). View file
 
workbooks/fixtures/8df5e07f-7a7d-4911-86bd-2e102df0cc7b_multi_department_budget.xlsx ADDED
Binary file (9.1 kB). View file