--- title: Financial Task Environment emoji: ๐Ÿ“Š colorFrom: green colorTo: blue sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv --- # Financial Task Environment An [OpenEnv](https://github.com/meta-pytorch/OpenEnv) **code-execution environment** for training and evaluating AI agents on **real-world finance & accounting spreadsheet tasks**. Agents write Python code (using `openpyxl`) to read, analyze, and modify authentic Excel workbooks from enterprise workflows. ## Motivation Finance professionals spend hundreds of hours on spreadsheet-centric tasks โ€” extracting values, computing ratios, auditing formulas, entering data, building scenarios, and consolidating reports. This environment provides 10 diverse tasks backed by real `.xlsx` files so agents can be trained and evaluated on the same kind of work. ## How It Works 1. **Reset** with a `task_id` โ†’ receive task instructions + xlsx file path + a summary of the spreadsheet contents. 2. **Execute code** (`action_type="code"`) โ†’ run Python code that reads or modifies the xlsx. The environment returns stdout/stderr. 3. **Submit** a text answer (`action_type="submit"` for QA tasks) or a modified file (`action_type="submit_file"` for MODIFY tasks). 4. The environment **grades** the submission: QA answers are scored by numeric matching + keyword overlap; MODIFY tasks are scored by cell-level comparison against a reference workbook. ## Tasks (10 total) | # | Task ID | Title | Difficulty | Type | Category | |---|---------|-------|------------|------|----------| | 1 | `task_1` | Count Plants in Spreadsheet | Easy | QA | Calculation | | 2 | `task_2` | Retrieve TW EOL Charge | Easy | QA | Cross-sheet Retrieval | | 3 | `task_3` | Portfolio Mark-to-Market Change | Easy | QA | Calculation | | 4 | `task_4` | Summarize Pipeline Imbalances | Medium | MODIFY | Calculation | | 5 | `task_5` | Audit and Correct Formula Errors | Medium | MODIFY | Validation / Review | | 6 | `task_6` | Create Table and Apply Filter | Medium | MODIFY | Structuring / Formatting | | 7 | `task_7` | Add Weekday Row and Data Entry | Medium | MODIFY | Data Entry / Import | | 8 | `task_8` | Balance Sheet Validation & Indicators | Hard | MODIFY | Validation, Calculation | | 9 | `task_9` | Create Scenario3 Worksheet | Hard | MODIFY | Financial Modeling | | 10 | `task_10` | Consolidate by Type and Area | Hard | MODIFY | Multi-type | ### Difficulty Progression - **Easy (3 tasks):** QA โ€” read the spreadsheet and answer a question. - **Medium (4 tasks):** MODIFY โ€” edit/augment the workbook (summaries, audits, formatting, data entry). - **Hard (3 tasks):** MODIFY โ€” complex multi-sheet operations (validation, new scenario sheets, consolidation). ## Action & Observation Spaces ### Action โ€” `FinancialAction` | Field | Type | Description | |-------|------|-------------| | `action_type` | `str` | `"code"` to execute Python, `"submit"` for text answer, `"submit_file"` for xlsx | | `content` | `str` | Python code, text answer, or file path | ### Observation โ€” `FinancialObservation` | Field | Type | Description | |-------|------|-------------| | `task_id` | `str` | Current task identifier | | `task_description` | `str` | Full task instructions + xlsx summary | | `source_file` | `str` | Path to the working xlsx copy | | `difficulty` | `str` | `easy`, `medium`, or `hard` | | `task_type` | `str` | `QA` or `MODIFY` | | `feedback` | `str` | Code output or grading result | | `current_step` | `int` | Current step (max 15) | | `done` | `bool` | Whether the episode is finished | | `reward` | `float` | Reward for this step (0.0โ€“1.0) | ## Reward Design | Action | Reward | Signal | |--------|--------|--------| | `code` (failed) | 0.005 | Penalized โ€” syntax/runtime error | | `code` (simple) | ~0.02 | Minimal โ€” just imports and a print | | `code` (exploration) | ~0.05 | Good โ€” reading data, producing output | | `code` (modification + save) | ~0.06โ€“0.10 | Best โ€” actively editing the workbook | | `submit` / `submit_file` | 0.001โ€“0.999 | Full grading against reference | | Max steps (15) | Episode ends | | Code step rewards are computed from: - **Execution success** โ€” failed code gets only 0.005 - **Substantive lines** โ€” lines beyond imports/comments earn +0.002 each (up to +0.03) - **Output produced** โ€” printing data earns +0.001 per line (up to +0.02) - **Save operations** โ€” calling `.save()` earns +0.03 (agent is modifying the workbook) **QA grading:** Numeric extraction with 5% tolerance + keyword overlap. **MODIFY grading:** 30% sheet-name match + 70% cell-level comparison (2% numeric tolerance). All scores are clamped to the open interval (0.001, 0.999). ## Setup & Usage ### Prerequisites - Python 3.10+ - Docker (for containerized deployment) - `pip install openenv-core openpyxl` ### Local Development ```bash pip install -e ".[dev]" PYTHONPATH=. uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload ``` ### Docker ```bash docker build -t financial-task-env:latest . docker run -p 8000:8000 financial-task-env:latest ``` ### Baseline Inference ```bash export API_BASE_URL="https://api.openai.com/v1" export MODEL_NAME="gpt-4o-mini" export HF_TOKEN="your-api-key" export ENV_URL="http://localhost:8000" python inference.py ``` ## Baseline Scores The environment includes 10 tasks, but the baseline inference runs 5 representative tasks (3 easy + 1 medium + 1 hard) to stay within the 20-minute runtime constraint. **Model:** `MiniMaxAI/MiniMax-M2.5` via HuggingFace Router | Task | Difficulty | Type | Score | Step Rewards | |------|------------|------|-------|-------------| | task_1 โ€” Count Plants | Easy | QA | 0.001 | 0.05, 0.06, 0.06, 0.06, 0.00 | | task_2 โ€” Retrieve EOL Charge | Easy | QA | 0.001 | 0.04, 0.01, 0.07, 0.06, 0.02, 0.00 | | task_3 โ€” Portfolio MTM Change | Easy | QA | 0.367 | 0.06, 0.01, 0.07, ..., 0.37 | | task_5 โ€” Audit Formulas | Medium | MODIFY | **0.958** | 0.07, 0.01, 0.07, ..., 0.96 | | task_8 โ€” Balance Sheet Validation | Hard | MODIFY | 0.001 | 0.06, 0.01, 0.06, ..., 0.05 | | **Average** | | | **0.266** | | **Runtime:** 12 min 10 sec (limit: 20 min) ยท **Server memory:** ~40 MB (limit: 8 GB) Note: Step rewards vary based on code quality โ€” failed code gets 0.005, exploration ~0.05, modification+save ~0.06โ€“0.10. ### Run 2 โ€” `google/gemma-4-26B-A4B-it` | Task | Difficulty | Type | Score | |------|------------|------|-------| | task_1 โ€” Count Plants | Easy | QA | 0.001 | | task_2 โ€” Retrieve EOL Charge | Easy | QA | **0.999** | | task_3 โ€” Portfolio MTM Change | Easy | QA | 0.001 | | task_5 โ€” Audit Formulas | Medium | MODIFY | 0.001 | | task_8 โ€” Balance Sheet Validation | Hard | MODIFY | 0.001 | | **Average** | | | **0.201** | **Runtime:** 19 min 27 sec (limit: 20 min) ยท **Server memory:** ~40 MB Gemma 4 26B solved task_2 perfectly in just 2 steps but timed out on more complex tasks due to longer generation times. ### Run 3 โ€” `Qwen/Qwen3.5-122B-A10B` | Task | Difficulty | Type | Score | |------|------------|------|-------| | task_1 โ€” Count Plants | Easy | QA | 0.001 | | task_2 โ€” Retrieve EOL Charge | Easy | QA | **0.999** | | task_3 โ€” Portfolio MTM Change | Easy | QA | 0.001 | | task_5 โ€” Audit Formulas | Medium | MODIFY | 0.001 | | task_8 โ€” Balance Sheet Validation | Hard | MODIFY | 0.001 | | **Average** | | | **0.201** | **Runtime:** 2 min 11 sec ยท Fast inference but hit per-task timeout on complex tasks. ### Run 4 โ€” `deepseek-ai/DeepSeek-R1` | Task | Difficulty | Type | Score | |------|------------|------|-------| | task_1 โ€” Count Plants | Easy | QA | 0.001 | | task_2 โ€” Retrieve EOL Charge | Easy | QA | 0.001 | | task_3 โ€” Portfolio MTM Change | Easy | QA | 0.001 | | task_5 โ€” Audit Formulas | Medium | MODIFY | 0.001 | | task_8 โ€” Balance Sheet Validation | Hard | MODIFY | 0.001 | | **Average** | | | **0.001** | **Runtime:** 11 min 57 sec ยท DeepSeek-R1's long chain-of-thought reasoning consumed most of the output tokens, leaving answers that didn't parse correctly. ### Run 5 โ€” `MiniMaxAI/MiniMax-M2.1` (Best) | Task | Difficulty | Type | Score | Steps | |------|------------|------|-------|-------| | task_1 โ€” Count Plants | Easy | QA | 0.001 | 5 | | task_2 โ€” Retrieve EOL Charge | Easy | QA | **0.999** | 4 | | task_3 โ€” Portfolio MTM Change | Easy | QA | 0.001 | 10 | | task_5 โ€” Audit Formulas | Medium | MODIFY | **0.958** | 4 | | task_8 โ€” Balance Sheet Validation | Hard | MODIFY | **0.733** | 10 | | **Average** | | | **0.538** | | **Runtime:** 3 min 18 sec ยท Best overall performance โ€” solved 3/5 tasks with high scores including the hard MODIFY task (0.733). Fast and efficient. ### Model Comparison Summary | Model | Avg Score | Runtime | Best Task | |-------|-----------|---------|-----------| | **MiniMax-M2.1** | **0.538** | **3m 18s** | task_5: 0.958, task_8: 0.733 | | MiniMax-M2.5 | 0.266 | 12m 10s | task_5: 0.958 | | Gemma 4 26B | 0.201 | 19m 27s | task_2: 0.999 | | Qwen 3.5 122B | 0.201 | 2m 11s | task_2: 0.999 | | DeepSeek-R1 | 0.001 | 11m 57s | โ€” | ## Project Structure ``` financial_task_env/ โ”œโ”€โ”€ __init__.py # Module exports โ”œโ”€โ”€ models.py # FinancialAction & FinancialObservation โ”œโ”€โ”€ tasks.py # 10 task definitions + xlsx paths โ”œโ”€โ”€ graders.py # QA grading + xlsx cell comparison โ”œโ”€โ”€ client.py # FinancialTaskEnv (EnvClient) โ”œโ”€โ”€ inference.py # Baseline inference script โ”œโ”€โ”€ openenv.yaml # OpenEnv manifest โ”œโ”€โ”€ pyproject.toml # Dependencies โ”œโ”€โ”€ Dockerfile # Container image โ”œโ”€โ”€ data/ # xlsx source & reference files โ”‚ โ”œโ”€โ”€ 0/ # Balance sheet validation โ”‚ โ”œโ”€โ”€ 21/ # Data entry โ”‚ โ”œโ”€โ”€ 24/ # Scenario modeling โ”‚ โ”œโ”€โ”€ 34/ # Portfolio calculation โ”‚ โ”œโ”€โ”€ 35/ # Pipeline imbalances โ”‚ โ”œโ”€โ”€ 40/ # Formula audit โ”‚ โ”œโ”€โ”€ 60/ # Table formatting โ”‚ โ”œโ”€โ”€ 67/ # Consolidation โ”‚ โ”œโ”€โ”€ 118/ # Value retrieval โ”‚ โ””โ”€โ”€ 119/ # Plant counting โ””โ”€โ”€ server/ โ”œโ”€โ”€ __init__.py โ”œโ”€โ”€ financial_environment.py # Code-execution environment โ”œโ”€โ”€ app.py # FastAPI application โ””โ”€โ”€ Dockerfile ``` ## Environment Description This environment models real financial spreadsheet work: - **Data extraction** โ€” read values from complex multi-sheet workbooks - **Calculation** โ€” compute portfolio changes, imbalances, indicators - **Validation** โ€” audit and fix formula errors in workbooks - **Data entry** โ€” add rows, enter values, format new columns - **Structuring** โ€” create tables, apply filters, build new worksheets - **Financial modeling** โ€” replicate scenario sheets with new parameters - **Consolidation** โ€” aggregate data across sheets into summary views Each task uses a genuine enterprise Excel workbook. MODIFY tasks are graded by spreadsheet properties comparison against a reference workbook. ## Acknowledgments The spreadsheet tasks and reference workbooks used in this environment are sourced from the **FinWorkBench (Finch)** dataset. If you use this environment in your research, please cite: ```bibtex @article{dong2025finch, title={Finch: Benchmarking Finance \& Accounting across Spreadsheet-Centric Enterprise Workflows}, author={Dong, Haoyu and Zhang, Pengkun and Gao, Yan and Dong, Xuanyu and Cheng, Yilin and Lu, Mingzhe and Yakefu, Adina and Zheng, Shuxin}, journal={arXiv preprint arXiv:2512.13168}, year={2025} } ```