financial-task-env / README.md
bpHigh's picture
Update readme
bf77949
metadata
title: Financial Task Environment
emoji: πŸ“Š
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

Financial Task Environment

An OpenEnv code-execution environment for training and evaluating AI agents on real-world finance & accounting spreadsheet tasks. Agents write Python code (using openpyxl) to read, analyze, and modify authentic Excel workbooks from enterprise workflows.

Motivation

Finance professionals spend hundreds of hours on spreadsheet-centric tasks β€” extracting values, computing ratios, auditing formulas, entering data, building scenarios, and consolidating reports. This environment provides 10 diverse tasks backed by real .xlsx files so agents can be trained and evaluated on the same kind of work.

How It Works

  1. Reset with a task_id β†’ receive task instructions + xlsx file path + a summary of the spreadsheet contents.
  2. Execute code (action_type="code") β†’ run Python code that reads or modifies the xlsx. The environment returns stdout/stderr.
  3. Submit a text answer (action_type="submit" for QA tasks) or a modified file (action_type="submit_file" for MODIFY tasks).
  4. The environment grades the submission: QA answers are scored by numeric matching + keyword overlap; MODIFY tasks are scored by cell-level comparison against a reference workbook.

Tasks (10 total)

# Task ID Title Difficulty Type Category
1 task_1 Count Plants in Spreadsheet Easy QA Calculation
2 task_2 Retrieve TW EOL Charge Easy QA Cross-sheet Retrieval
3 task_3 Portfolio Mark-to-Market Change Easy QA Calculation
4 task_4 Summarize Pipeline Imbalances Medium MODIFY Calculation
5 task_5 Audit and Correct Formula Errors Medium MODIFY Validation / Review
6 task_6 Create Table and Apply Filter Medium MODIFY Structuring / Formatting
7 task_7 Add Weekday Row and Data Entry Medium MODIFY Data Entry / Import
8 task_8 Balance Sheet Validation & Indicators Hard MODIFY Validation, Calculation
9 task_9 Create Scenario3 Worksheet Hard MODIFY Financial Modeling
10 task_10 Consolidate by Type and Area Hard MODIFY Multi-type

Difficulty Progression

  • Easy (3 tasks): QA β€” read the spreadsheet and answer a question.
  • Medium (4 tasks): MODIFY β€” edit/augment the workbook (summaries, audits, formatting, data entry).
  • Hard (3 tasks): MODIFY β€” complex multi-sheet operations (validation, new scenario sheets, consolidation).

Action & Observation Spaces

Action β€” FinancialAction

Field Type Description
action_type str "code" to execute Python, "submit" for text answer, "submit_file" for xlsx
content str Python code, text answer, or file path

Observation β€” FinancialObservation

Field Type Description
task_id str Current task identifier
task_description str Full task instructions + xlsx summary
source_file str Path to the working xlsx copy
difficulty str easy, medium, or hard
task_type str QA or MODIFY
feedback str Code output or grading result
current_step int Current step (max 15)
done bool Whether the episode is finished
reward float Reward for this step (0.0–1.0)

Reward Design

Action Reward Signal
code (failed) 0.005 Penalized β€” syntax/runtime error
code (simple) ~0.02 Minimal β€” just imports and a print
code (exploration) ~0.05 Good β€” reading data, producing output
code (modification + save) ~0.06–0.10 Best β€” actively editing the workbook
submit / submit_file 0.001–0.999 Full grading against reference
Max steps (15) Episode ends

Code step rewards are computed from:

  • Execution success β€” failed code gets only 0.005
  • Substantive lines β€” lines beyond imports/comments earn +0.002 each (up to +0.03)
  • Output produced β€” printing data earns +0.001 per line (up to +0.02)
  • Save operations β€” calling .save() earns +0.03 (agent is modifying the workbook)

QA grading: Numeric extraction with 5% tolerance + keyword overlap. MODIFY grading: 30% sheet-name match + 70% cell-level comparison (2% numeric tolerance).

All scores are clamped to the open interval (0.001, 0.999).

Setup & Usage

Prerequisites

  • Python 3.10+
  • Docker (for containerized deployment)
  • pip install openenv-core openpyxl

Local Development

pip install -e ".[dev]"
PYTHONPATH=. uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload

Docker

docker build -t financial-task-env:latest .
docker run -p 8000:8000 financial-task-env:latest

Baseline Inference

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export HF_TOKEN="your-api-key"
export ENV_URL="http://localhost:8000"
python inference.py

Baseline Scores

The environment includes 10 tasks, but the baseline inference runs 5 representative tasks (3 easy + 1 medium + 1 hard) to stay within the 20-minute runtime constraint.

Model: MiniMaxAI/MiniMax-M2.5 via HuggingFace Router

Task Difficulty Type Score Step Rewards
task_1 β€” Count Plants Easy QA 0.001 0.05, 0.06, 0.06, 0.06, 0.00
task_2 β€” Retrieve EOL Charge Easy QA 0.001 0.04, 0.01, 0.07, 0.06, 0.02, 0.00
task_3 β€” Portfolio MTM Change Easy QA 0.367 0.06, 0.01, 0.07, ..., 0.37
task_5 β€” Audit Formulas Medium MODIFY 0.958 0.07, 0.01, 0.07, ..., 0.96
task_8 β€” Balance Sheet Validation Hard MODIFY 0.001 0.06, 0.01, 0.06, ..., 0.05
Average 0.266

Runtime: 12 min 10 sec (limit: 20 min) Β· Server memory: ~40 MB (limit: 8 GB)

Note: Step rewards vary based on code quality β€” failed code gets 0.005, exploration ~0.05, modification+save ~0.06–0.10.

Run 2 β€” google/gemma-4-26B-A4B-it

Task Difficulty Type Score
task_1 β€” Count Plants Easy QA 0.001
task_2 β€” Retrieve EOL Charge Easy QA 0.999
task_3 β€” Portfolio MTM Change Easy QA 0.001
task_5 β€” Audit Formulas Medium MODIFY 0.001
task_8 β€” Balance Sheet Validation Hard MODIFY 0.001
Average 0.201

Runtime: 19 min 27 sec (limit: 20 min) Β· Server memory: ~40 MB

Gemma 4 26B solved task_2 perfectly in just 2 steps but timed out on more complex tasks due to longer generation times.

Run 3 β€” Qwen/Qwen3.5-122B-A10B

Task Difficulty Type Score
task_1 β€” Count Plants Easy QA 0.001
task_2 β€” Retrieve EOL Charge Easy QA 0.999
task_3 β€” Portfolio MTM Change Easy QA 0.001
task_5 β€” Audit Formulas Medium MODIFY 0.001
task_8 β€” Balance Sheet Validation Hard MODIFY 0.001
Average 0.201

Runtime: 2 min 11 sec Β· Fast inference but hit per-task timeout on complex tasks.

Run 4 β€” deepseek-ai/DeepSeek-R1

Task Difficulty Type Score
task_1 β€” Count Plants Easy QA 0.001
task_2 β€” Retrieve EOL Charge Easy QA 0.001
task_3 β€” Portfolio MTM Change Easy QA 0.001
task_5 β€” Audit Formulas Medium MODIFY 0.001
task_8 β€” Balance Sheet Validation Hard MODIFY 0.001
Average 0.001

Runtime: 11 min 57 sec Β· DeepSeek-R1's long chain-of-thought reasoning consumed most of the output tokens, leaving answers that didn't parse correctly.

Run 5 β€” MiniMaxAI/MiniMax-M2.1 (Best)

Task Difficulty Type Score Steps
task_1 β€” Count Plants Easy QA 0.001 5
task_2 β€” Retrieve EOL Charge Easy QA 0.999 4
task_3 β€” Portfolio MTM Change Easy QA 0.001 10
task_5 β€” Audit Formulas Medium MODIFY 0.958 4
task_8 β€” Balance Sheet Validation Hard MODIFY 0.733 10
Average 0.538

Runtime: 3 min 18 sec Β· Best overall performance β€” solved 3/5 tasks with high scores including the hard MODIFY task (0.733). Fast and efficient.

Model Comparison Summary

Model Avg Score Runtime Best Task
MiniMax-M2.1 0.538 3m 18s task_5: 0.958, task_8: 0.733
MiniMax-M2.5 0.266 12m 10s task_5: 0.958
Gemma 4 26B 0.201 19m 27s task_2: 0.999
Qwen 3.5 122B 0.201 2m 11s task_2: 0.999
DeepSeek-R1 0.001 11m 57s β€”

Project Structure

financial_task_env/
β”œβ”€β”€ __init__.py              # Module exports
β”œβ”€β”€ models.py                # FinancialAction & FinancialObservation
β”œβ”€β”€ tasks.py                 # 10 task definitions + xlsx paths
β”œβ”€β”€ graders.py               # QA grading + xlsx cell comparison
β”œβ”€β”€ client.py                # FinancialTaskEnv (EnvClient)
β”œβ”€β”€ inference.py             # Baseline inference script
β”œβ”€β”€ openenv.yaml             # OpenEnv manifest
β”œβ”€β”€ pyproject.toml           # Dependencies
β”œβ”€β”€ Dockerfile               # Container image
β”œβ”€β”€ data/                    # xlsx source & reference files
β”‚   β”œβ”€β”€ 0/                   # Balance sheet validation
β”‚   β”œβ”€β”€ 21/                  # Data entry
β”‚   β”œβ”€β”€ 24/                  # Scenario modeling
β”‚   β”œβ”€β”€ 34/                  # Portfolio calculation
β”‚   β”œβ”€β”€ 35/                  # Pipeline imbalances
β”‚   β”œβ”€β”€ 40/                  # Formula audit
β”‚   β”œβ”€β”€ 60/                  # Table formatting
β”‚   β”œβ”€β”€ 67/                  # Consolidation
β”‚   β”œβ”€β”€ 118/                 # Value retrieval
β”‚   └── 119/                 # Plant counting
└── server/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ financial_environment.py  # Code-execution environment
    β”œβ”€β”€ app.py                    # FastAPI application
    └── Dockerfile

Environment Description

This environment models real financial spreadsheet work:

  • Data extraction β€” read values from complex multi-sheet workbooks
  • Calculation β€” compute portfolio changes, imbalances, indicators
  • Validation β€” audit and fix formula errors in workbooks
  • Data entry β€” add rows, enter values, format new columns
  • Structuring β€” create tables, apply filters, build new worksheets
  • Financial modeling β€” replicate scenario sheets with new parameters
  • Consolidation β€” aggregate data across sheets into summary views

Each task uses a genuine enterprise Excel workbook. MODIFY tasks are graded by spreadsheet properties comparison against a reference workbook.

Acknowledgments

The spreadsheet tasks and reference workbooks used in this environment are sourced from the FinWorkBench (Finch) dataset. If you use this environment in your research, please cite:

@article{dong2025finch,
  title={Finch: Benchmarking Finance \& Accounting across Spreadsheet-Centric Enterprise Workflows},
  author={Dong, Haoyu and Zhang, Pengkun and Gao, Yan and Dong, Xuanyu and Cheng, Yilin and Lu, Mingzhe and Yakefu, Adina and Zheng, Shuxin},
  journal={arXiv preprint arXiv:2512.13168},
  year={2025}
}