Spaces:
Running
Running
| title: Financial Task Environment | |
| emoji: π | |
| colorFrom: green | |
| colorTo: blue | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| # Financial Task Environment | |
| An [OpenEnv](https://github.com/meta-pytorch/OpenEnv) **code-execution | |
| environment** for training and evaluating AI agents on **real-world finance | |
| & accounting spreadsheet tasks**. Agents write Python code (using | |
| `openpyxl`) to read, analyze, and modify authentic Excel workbooks from | |
| enterprise workflows. | |
| ## Motivation | |
| Finance professionals spend hundreds of hours on spreadsheet-centric tasks β | |
| extracting values, computing ratios, auditing formulas, entering data, building | |
| scenarios, and consolidating reports. This environment provides 10 diverse | |
| tasks backed by real `.xlsx` files so agents can be trained and evaluated on | |
| the same kind of work. | |
| ## How It Works | |
| 1. **Reset** with a `task_id` β receive task instructions + xlsx file path + a | |
| summary of the spreadsheet contents. | |
| 2. **Execute code** (`action_type="code"`) β run Python code that reads or | |
| modifies the xlsx. The environment returns stdout/stderr. | |
| 3. **Submit** a text answer (`action_type="submit"` for QA tasks) or a modified | |
| file (`action_type="submit_file"` for MODIFY tasks). | |
| 4. The environment **grades** the submission: QA answers are scored by numeric | |
| matching + keyword overlap; MODIFY tasks are scored by cell-level comparison | |
| against a reference workbook. | |
| ## Tasks (10 total) | |
| | # | Task ID | Title | Difficulty | Type | Category | | |
| |---|---------|-------|------------|------|----------| | |
| | 1 | `task_1` | Count Plants in Spreadsheet | Easy | QA | Calculation | | |
| | 2 | `task_2` | Retrieve TW EOL Charge | Easy | QA | Cross-sheet Retrieval | | |
| | 3 | `task_3` | Portfolio Mark-to-Market Change | Easy | QA | Calculation | | |
| | 4 | `task_4` | Summarize Pipeline Imbalances | Medium | MODIFY | Calculation | | |
| | 5 | `task_5` | Audit and Correct Formula Errors | Medium | MODIFY | Validation / Review | | |
| | 6 | `task_6` | Create Table and Apply Filter | Medium | MODIFY | Structuring / Formatting | | |
| | 7 | `task_7` | Add Weekday Row and Data Entry | Medium | MODIFY | Data Entry / Import | | |
| | 8 | `task_8` | Balance Sheet Validation & Indicators | Hard | MODIFY | Validation, Calculation | | |
| | 9 | `task_9` | Create Scenario3 Worksheet | Hard | MODIFY | Financial Modeling | | |
| | 10 | `task_10` | Consolidate by Type and Area | Hard | MODIFY | Multi-type | | |
| ### Difficulty Progression | |
| - **Easy (3 tasks):** QA β read the spreadsheet and answer a question. | |
| - **Medium (4 tasks):** MODIFY β edit/augment the workbook (summaries, audits, formatting, data entry). | |
| - **Hard (3 tasks):** MODIFY β complex multi-sheet operations (validation, new scenario sheets, consolidation). | |
| ## Action & Observation Spaces | |
| ### Action β `FinancialAction` | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | `action_type` | `str` | `"code"` to execute Python, `"submit"` for text answer, `"submit_file"` for xlsx | | |
| | `content` | `str` | Python code, text answer, or file path | | |
| ### Observation β `FinancialObservation` | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | `task_id` | `str` | Current task identifier | | |
| | `task_description` | `str` | Full task instructions + xlsx summary | | |
| | `source_file` | `str` | Path to the working xlsx copy | | |
| | `difficulty` | `str` | `easy`, `medium`, or `hard` | | |
| | `task_type` | `str` | `QA` or `MODIFY` | | |
| | `feedback` | `str` | Code output or grading result | | |
| | `current_step` | `int` | Current step (max 15) | | |
| | `done` | `bool` | Whether the episode is finished | | |
| | `reward` | `float` | Reward for this step (0.0β1.0) | | |
| ## Reward Design | |
| | Action | Reward | Signal | | |
| |--------|--------|--------| | |
| | `code` (failed) | 0.005 | Penalized β syntax/runtime error | | |
| | `code` (simple) | ~0.02 | Minimal β just imports and a print | | |
| | `code` (exploration) | ~0.05 | Good β reading data, producing output | | |
| | `code` (modification + save) | ~0.06β0.10 | Best β actively editing the workbook | | |
| | `submit` / `submit_file` | 0.001β0.999 | Full grading against reference | | |
| | Max steps (15) | Episode ends | | | |
| Code step rewards are computed from: | |
| - **Execution success** β failed code gets only 0.005 | |
| - **Substantive lines** β lines beyond imports/comments earn +0.002 each (up to +0.03) | |
| - **Output produced** β printing data earns +0.001 per line (up to +0.02) | |
| - **Save operations** β calling `.save()` earns +0.03 (agent is modifying the workbook) | |
| **QA grading:** Numeric extraction with 5% tolerance + keyword overlap. | |
| **MODIFY grading:** 30% sheet-name match + 70% cell-level comparison (2% numeric tolerance). | |
| All scores are clamped to the open interval (0.001, 0.999). | |
| ## Setup & Usage | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - Docker (for containerized deployment) | |
| - `pip install openenv-core openpyxl` | |
| ### Local Development | |
| ```bash | |
| pip install -e ".[dev]" | |
| PYTHONPATH=. uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker build -t financial-task-env:latest . | |
| docker run -p 8000:8000 financial-task-env:latest | |
| ``` | |
| ### Baseline Inference | |
| ```bash | |
| export API_BASE_URL="https://api.openai.com/v1" | |
| export MODEL_NAME="gpt-4o-mini" | |
| export HF_TOKEN="your-api-key" | |
| export ENV_URL="http://localhost:8000" | |
| python inference.py | |
| ``` | |
| ## Baseline Scores | |
| The environment includes 10 tasks, but the baseline inference runs 5 representative | |
| tasks (3 easy + 1 medium + 1 hard) to stay within the 20-minute runtime constraint. | |
| **Model:** `MiniMaxAI/MiniMax-M2.5` via HuggingFace Router | |
| | Task | Difficulty | Type | Score | Step Rewards | | |
| |------|------------|------|-------|-------------| | |
| | task_1 β Count Plants | Easy | QA | 0.001 | 0.05, 0.06, 0.06, 0.06, 0.00 | | |
| | task_2 β Retrieve EOL Charge | Easy | QA | 0.001 | 0.04, 0.01, 0.07, 0.06, 0.02, 0.00 | | |
| | task_3 β Portfolio MTM Change | Easy | QA | 0.367 | 0.06, 0.01, 0.07, ..., 0.37 | | |
| | task_5 β Audit Formulas | Medium | MODIFY | **0.958** | 0.07, 0.01, 0.07, ..., 0.96 | | |
| | task_8 β Balance Sheet Validation | Hard | MODIFY | 0.001 | 0.06, 0.01, 0.06, ..., 0.05 | | |
| | **Average** | | | **0.266** | | | |
| **Runtime:** 12 min 10 sec (limit: 20 min) Β· **Server memory:** ~40 MB (limit: 8 GB) | |
| Note: Step rewards vary based on code quality β failed code gets 0.005, exploration | |
| ~0.05, modification+save ~0.06β0.10. | |
| ### Run 2 β `google/gemma-4-26B-A4B-it` | |
| | Task | Difficulty | Type | Score | | |
| |------|------------|------|-------| | |
| | task_1 β Count Plants | Easy | QA | 0.001 | | |
| | task_2 β Retrieve EOL Charge | Easy | QA | **0.999** | | |
| | task_3 β Portfolio MTM Change | Easy | QA | 0.001 | | |
| | task_5 β Audit Formulas | Medium | MODIFY | 0.001 | | |
| | task_8 β Balance Sheet Validation | Hard | MODIFY | 0.001 | | |
| | **Average** | | | **0.201** | | |
| **Runtime:** 19 min 27 sec (limit: 20 min) Β· **Server memory:** ~40 MB | |
| Gemma 4 26B solved task_2 perfectly in just 2 steps but timed out on more | |
| complex tasks due to longer generation times. | |
| ### Run 3 β `Qwen/Qwen3.5-122B-A10B` | |
| | Task | Difficulty | Type | Score | | |
| |------|------------|------|-------| | |
| | task_1 β Count Plants | Easy | QA | 0.001 | | |
| | task_2 β Retrieve EOL Charge | Easy | QA | **0.999** | | |
| | task_3 β Portfolio MTM Change | Easy | QA | 0.001 | | |
| | task_5 β Audit Formulas | Medium | MODIFY | 0.001 | | |
| | task_8 β Balance Sheet Validation | Hard | MODIFY | 0.001 | | |
| | **Average** | | | **0.201** | | |
| **Runtime:** 2 min 11 sec Β· Fast inference but hit per-task timeout on complex tasks. | |
| ### Run 4 β `deepseek-ai/DeepSeek-R1` | |
| | Task | Difficulty | Type | Score | | |
| |------|------------|------|-------| | |
| | task_1 β Count Plants | Easy | QA | 0.001 | | |
| | task_2 β Retrieve EOL Charge | Easy | QA | 0.001 | | |
| | task_3 β Portfolio MTM Change | Easy | QA | 0.001 | | |
| | task_5 β Audit Formulas | Medium | MODIFY | 0.001 | | |
| | task_8 β Balance Sheet Validation | Hard | MODIFY | 0.001 | | |
| | **Average** | | | **0.001** | | |
| **Runtime:** 11 min 57 sec Β· DeepSeek-R1's long chain-of-thought reasoning consumed | |
| most of the output tokens, leaving answers that didn't parse correctly. | |
| ### Run 5 β `MiniMaxAI/MiniMax-M2.1` (Best) | |
| | Task | Difficulty | Type | Score | Steps | | |
| |------|------------|------|-------|-------| | |
| | task_1 β Count Plants | Easy | QA | 0.001 | 5 | | |
| | task_2 β Retrieve EOL Charge | Easy | QA | **0.999** | 4 | | |
| | task_3 β Portfolio MTM Change | Easy | QA | 0.001 | 10 | | |
| | task_5 β Audit Formulas | Medium | MODIFY | **0.958** | 4 | | |
| | task_8 β Balance Sheet Validation | Hard | MODIFY | **0.733** | 10 | | |
| | **Average** | | | **0.538** | | | |
| **Runtime:** 3 min 18 sec Β· Best overall performance β solved 3/5 tasks with high | |
| scores including the hard MODIFY task (0.733). Fast and efficient. | |
| ### Model Comparison Summary | |
| | Model | Avg Score | Runtime | Best Task | | |
| |-------|-----------|---------|-----------| | |
| | **MiniMax-M2.1** | **0.538** | **3m 18s** | task_5: 0.958, task_8: 0.733 | | |
| | MiniMax-M2.5 | 0.266 | 12m 10s | task_5: 0.958 | | |
| | Gemma 4 26B | 0.201 | 19m 27s | task_2: 0.999 | | |
| | Qwen 3.5 122B | 0.201 | 2m 11s | task_2: 0.999 | | |
| | DeepSeek-R1 | 0.001 | 11m 57s | β | | |
| ## Project Structure | |
| ``` | |
| financial_task_env/ | |
| βββ __init__.py # Module exports | |
| βββ models.py # FinancialAction & FinancialObservation | |
| βββ tasks.py # 10 task definitions + xlsx paths | |
| βββ graders.py # QA grading + xlsx cell comparison | |
| βββ client.py # FinancialTaskEnv (EnvClient) | |
| βββ inference.py # Baseline inference script | |
| βββ openenv.yaml # OpenEnv manifest | |
| βββ pyproject.toml # Dependencies | |
| βββ Dockerfile # Container image | |
| βββ data/ # xlsx source & reference files | |
| β βββ 0/ # Balance sheet validation | |
| β βββ 21/ # Data entry | |
| β βββ 24/ # Scenario modeling | |
| β βββ 34/ # Portfolio calculation | |
| β βββ 35/ # Pipeline imbalances | |
| β βββ 40/ # Formula audit | |
| β βββ 60/ # Table formatting | |
| β βββ 67/ # Consolidation | |
| β βββ 118/ # Value retrieval | |
| β βββ 119/ # Plant counting | |
| βββ server/ | |
| βββ __init__.py | |
| βββ financial_environment.py # Code-execution environment | |
| βββ app.py # FastAPI application | |
| βββ Dockerfile | |
| ``` | |
| ## Environment Description | |
| This environment models real financial spreadsheet work: | |
| - **Data extraction** β read values from complex multi-sheet workbooks | |
| - **Calculation** β compute portfolio changes, imbalances, indicators | |
| - **Validation** β audit and fix formula errors in workbooks | |
| - **Data entry** β add rows, enter values, format new columns | |
| - **Structuring** β create tables, apply filters, build new worksheets | |
| - **Financial modeling** β replicate scenario sheets with new parameters | |
| - **Consolidation** β aggregate data across sheets into summary views | |
| Each task uses a genuine enterprise Excel workbook. MODIFY tasks are graded | |
| by spreadsheet properties comparison against a reference workbook. | |
| ## Acknowledgments | |
| The spreadsheet tasks and reference workbooks used in this environment are | |
| sourced from the **FinWorkBench (Finch)** dataset. If you use this environment | |
| in your research, please cite: | |
| ```bibtex | |
| @article{dong2025finch, | |
| title={Finch: Benchmarking Finance \& Accounting across Spreadsheet-Centric Enterprise Workflows}, | |
| author={Dong, Haoyu and Zhang, Pengkun and Gao, Yan and Dong, Xuanyu and Cheng, Yilin and Lu, Mingzhe and Yakefu, Adina and Zheng, Shuxin}, | |
| journal={arXiv preprint arXiv:2512.13168}, | |
| year={2025} | |
| } | |
| ``` | |