Spaces:

bpHigh
/

financial-task-env

Sleeping

App Files Files Community

bpHigh commited on 15 days ago

Commit

7485602

1 Parent(s): ea742f5

Financial Task Environment — code execution with real xlsx

Browse files

Files changed (34) hide show

.dockerignore +10 -0
.gitattributes +1 -0
Dockerfile +33 -0
README.md +181 -2
__init__.py +10 -0
client.py +45 -0
data/0/0_ref_0.xlsx +3 -0
data/0/0_src_0.xlsx +3 -0
data/118/118_src_0.xlsx +3 -0
data/119/119_src_0.xlsx +3 -0
data/21/21_ref_0.xlsx +3 -0
data/21/21_src_0.xlsx +3 -0
data/24/24_ref_0.xlsx +3 -0
data/24/24_src_0.xlsx +3 -0
data/34/34_src_0.xlsx +3 -0
data/35/35_ref_0.xlsx +3 -0
data/35/35_src_0.xlsx +3 -0
data/40/40_ref_0.xlsx +3 -0
data/40/40_src_0.xlsx +3 -0
data/60/60_ref_0.xlsx +3 -0
data/60/60_src_0.xlsx +3 -0
data/67/67_ref_0.xlsx +3 -0
data/67/67_src_0.xlsx +3 -0
graders.py +183 -0
inference.py +325 -0
models.py +44 -0
openenv.yaml +47 -0
pyproject.toml +35 -0
server/Dockerfile +39 -0
server/__init__.py +1 -0
server/app.py +24 -0
server/financial_environment.py +297 -0
tasks.py +284 -0
uv.lock +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,10 @@

+__pycache__/
+*.pyc
+*.pyo
+.git/
+.venv/
+outputs/
+*.egg-info/
+dist/
+build/
+.pytest_cache/

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ *.xlsx filter=lfs diff=lfs merge=lfs -text

Dockerfile ADDED Viewed

	@@ -0,0 +1,33 @@

+# Also place Dockerfile at repo root (required by validation)
+FROM python:3.11-slim
+WORKDIR /app/env
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git curl \
+    && rm -rf /var/lib/apt/lists/*
+RUN curl -LsSf https://astral.sh/uv/install.sh | sh && \
+    mv /root/.local/bin/uv /usr/local/bin/uv && \
+    mv /root/.local/bin/uvx /usr/local/bin/uvx
+COPY pyproject.toml /app/env/
+RUN uv pip install --system --no-cache-dir \
+    "openenv-core>=0.2.0" \
+    "fastapi>=0.104.0" \
+    "uvicorn>=0.24.0" \
+    "pydantic>=2.0.0" \
+    "websockets>=12.0" \
+    "openpyxl>=3.1.0"
+COPY . /app/env/
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+EXPOSE 8000
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

@@ -1,2 +1,181 @@
-# openenv_financial_task_env
-An openenv environment for day to day financial tasks

+---
+title: Financial Task Environment
+emoji: 📊
+colorFrom: green
+colorTo: blue
+sdk: docker
+pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+---
+# Financial Task Environment
+An [OpenEnv](https://github.com/meta-pytorch/OpenEnv) **code-execution
+environment** for training and evaluating AI agents on **real-world finance
+& accounting spreadsheet tasks**.  Agents write Python code (using
+`openpyxl`) to read, analyze, and modify authentic Excel workbooks from
+enterprise workflows.
+## Motivation
+Finance professionals spend hundreds of hours on spreadsheet-centric tasks —
+extracting values, computing ratios, auditing formulas, entering data, building
+scenarios, and consolidating reports.  This environment provides 10 diverse
+tasks backed by real `.xlsx` files so agents can be trained and evaluated on
+the same kind of work.
+## How It Works
+1. **Reset** with a `task_id` → receive task instructions + xlsx file path + a
+   summary of the spreadsheet contents.
+2. **Execute code** (`action_type="code"`) → run Python code that reads or
+   modifies the xlsx.  The environment returns stdout/stderr.
+3. **Submit** a text answer (`action_type="submit"` for QA tasks) or a modified
+   file (`action_type="submit_file"` for MODIFY tasks).
+4. The environment **grades** the submission: QA answers are scored by numeric
+   matching + keyword overlap; MODIFY tasks are scored by cell-level comparison
+   against a reference workbook.
+## Tasks (10 total)
+| # | Task ID | Title | Difficulty | Type | Category |
+|---|---------|-------|------------|------|----------|
+| 1 | `task_1` | Count Plants in Spreadsheet | Easy | QA | Calculation |
+| 2 | `task_2` | Retrieve TW EOL Charge | Easy | QA | Cross-sheet Retrieval |
+| 3 | `task_3` | Portfolio Mark-to-Market Change | Easy | QA | Calculation |
+| 4 | `task_4` | Summarize Pipeline Imbalances | Medium | MODIFY | Calculation |
+| 5 | `task_5` | Audit and Correct Formula Errors | Medium | MODIFY | Validation / Review |
+| 6 | `task_6` | Create Table and Apply Filter | Medium | MODIFY | Structuring / Formatting |
+| 7 | `task_7` | Add Weekday Row and Data Entry | Medium | MODIFY | Data Entry / Import |
+| 8 | `task_8` | Balance Sheet Validation & Indicators | Hard | MODIFY | Validation, Calculation |
+| 9 | `task_9` | Create Scenario3 Worksheet | Hard | MODIFY | Financial Modeling |
+| 10 | `task_10` | Consolidate by Type and Area | Hard | MODIFY | Multi-type |
+### Difficulty Progression
+- **Easy (3 tasks):** QA — read the spreadsheet and answer a question.
+- **Medium (4 tasks):** MODIFY — edit/augment the workbook (summaries, audits, formatting, data entry).
+- **Hard (3 tasks):** MODIFY — complex multi-sheet operations (validation, new scenario sheets, consolidation).
+## Action & Observation Spaces
+### Action — `FinancialAction`
+| Field | Type | Description |
+|-------|------|-------------|
+| `action_type` | `str` | `"code"` to execute Python, `"submit"` for text answer, `"submit_file"` for xlsx |
+| `content` | `str` | Python code, text answer, or file path |
+### Observation — `FinancialObservation`
+| Field | Type | Description |
+|-------|------|-------------|
+| `task_id` | `str` | Current task identifier |
+| `task_description` | `str` | Full task instructions + xlsx summary |
+| `source_file` | `str` | Path to the working xlsx copy |
+| `difficulty` | `str` | `easy`, `medium`, or `hard` |
+| `task_type` | `str` | `QA` or `MODIFY` |
+| `feedback` | `str` | Code output or grading result |
+| `current_step` | `int` | Current step (max 15) |
+| `done` | `bool` | Whether the episode is finished |
+| `reward` | `float` | Reward for this step (0.0–1.0) |
+## Reward Design
+| Action | Reward | Signal |
+|--------|--------|--------|
+| `code` | 0.02 | Small reward for active exploration |
+| `submit` / `submit_file` | 0.0–1.0 | Graded against reference |
+| Max steps (15) | Episode ends | |
+**QA grading:** Numeric extraction with 5% tolerance + keyword overlap.
+**MODIFY grading:** 30% sheet-name match + 70% cell-level comparison (2% numeric tolerance).
+## Setup & Usage
+### Prerequisites
+- Python 3.10+
+- Docker (for containerized deployment)
+- `pip install openenv-core openpyxl`
+### Local Development
+```bash
+pip install -e ".[dev]"
+PYTHONPATH=. uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
+```
+### Docker
+```bash
+docker build -t financial-task-env:latest .
+docker run -p 8000:8000 financial-task-env:latest
+```
+### Baseline Inference
+```bash
+export API_BASE_URL="https://api.openai.com/v1"
+export MODEL_NAME="gpt-4o-mini"
+export HF_TOKEN="your-api-key"
+export ENV_URL="http://localhost:8000"
+python inference.py
+```
+## Baseline Scores
+| Difficulty | Type | Expected Range |
+|------------|------|---------------|
+| Easy | QA | 0.60 – 1.00 |
+| Medium | MODIFY | 0.30 – 0.80 |
+| Hard | MODIFY | 0.10 – 0.60 |
+## Project Structure
+```
+financial_task_env/
+├── __init__.py              # Module exports
+├── models.py                # FinancialAction & FinancialObservation
+├── tasks.py                 # 10 task definitions + xlsx paths
+├── graders.py               # QA grading + xlsx cell comparison
+├── client.py                # FinancialTaskEnv (EnvClient)
+├── inference.py             # Baseline inference script
+├── openenv.yaml             # OpenEnv manifest
+├── pyproject.toml           # Dependencies
+├── Dockerfile               # Container image
+├── data/                    # xlsx source & reference files
+│   ├── 0/                   # Balance sheet validation
+│   ├── 21/                  # Data entry
+│   ├── 24/                  # Scenario modeling
+│   ├── 34/                  # Portfolio calculation
+│   ├── 35/                  # Pipeline imbalances
+│   ├── 40/                  # Formula audit
+│   ├── 60/                  # Table formatting
+│   ├── 67/                  # Consolidation
+│   ├── 118/                 # Value retrieval
+│   └── 119/                 # Plant counting
+└── server/
+    ├── __init__.py
+    ├── financial_environment.py  # Code-execution environment
+    ├── app.py                    # FastAPI application
+    └── Dockerfile
+```
+## Environment Description
+This environment models real financial spreadsheet work:
+- **Data extraction** — read values from complex multi-sheet workbooks
+- **Calculation** — compute portfolio changes, imbalances, indicators
+- **Validation** — audit and fix formula errors in workbooks
+- **Data entry** — add rows, enter values, format new columns
+- **Structuring** — create tables, apply filters, build new worksheets
+- **Financial modeling** — replicate scenario sheets with new parameters
+- **Consolidation** — aggregate data across sheets into summary views
+Each task uses a genuine enterprise Excel workbook.  MODIFY tasks are graded
+by cell-level comparison against a reference workbook.

__init__.py ADDED Viewed

	@@ -0,0 +1,10 @@

+"""Financial Task Environment — an OpenEnv environment for finance & accounting tasks.
+Covers real-world enterprise workflows including data extraction,
+ratio analysis, reconciliation, valuation, and consolidation.
+"""
+from models import FinancialAction, FinancialObservation
+from client import FinancialTaskEnv
+__all__ = ["FinancialAction", "FinancialObservation", "FinancialTaskEnv"]

client.py ADDED Viewed

	@@ -0,0 +1,45 @@

+"""Financial Task Environment client."""
+from __future__ import annotations
+from typing import Any, Dict
+from openenv.core.env_client import EnvClient
+from openenv.core.client_types import StepResult, StateT
+from models import FinancialAction, FinancialObservation
+class FinancialTaskEnv(EnvClient["FinancialAction", "FinancialObservation", StateT]):
+    """Client for connecting to a Financial Task Environment server.
+    Example (async)::
+        async with FinancialTaskEnv(base_url="http://localhost:8000") as env:
+            result = await env.reset(task_id="task_1")
+            print(result.observation.task_description)
+            result = await env.step(FinancialAction(action_type="submit", content="42"))
+            print(result.reward)
+    Example (sync)::
+        with FinancialTaskEnv(base_url="http://localhost:8000").sync() as env:
+            result = env.reset(task_id="task_1")
+            result = env.step(FinancialAction(action_type="submit", content="42"))
+    """
+    def _step_payload(self, action: FinancialAction) -> Dict[str, Any]:
+        return action.model_dump()
+    def _parse_result(self, payload: Dict[str, Any]) -> StepResult[FinancialObservation]:
+        obs = FinancialObservation(**payload)
+        return StepResult(
+            observation=obs,
+            reward=obs.reward if isinstance(obs.reward, (int, float)) else 0.0,
+            done=obs.done,
+        )
+    def _parse_state(self, payload: Dict[str, Any]) -> Any:
+        from openenv.core.env_server.types import State
+        return State(**payload)

data/0/0_ref_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fc273e786bfae6a6bb70f9fac0a5663fcb788344ce0faaec4d5c02392bd7d646
+size 80606

data/0/0_src_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:40397df4ff9ae47d84f071ca01886edcac90a28c2bc0aeb18556264222de807b
+size 79613

data/118/118_src_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fa0b8b943ab728e4b3be39178f4e8b05fd71095706b1e0520149e022c1e40c3f
+size 131652

data/119/119_src_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a3cf63aa836a798cd4f41836ecf4f46c6bd15e5fb3b2025013ed925fea40d47d
+size 40000

data/21/21_ref_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:01f24e077eda9026d5a144cd3b894881f2c57d508c0e4d3d2c3427e42b55eddc
+size 30038

data/21/21_src_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1471369e7905b17e465ee7c64864084af2d8535995880b869a637bb67681aaf5
+size 29119

data/24/24_ref_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f7108dce2da312f65995628a7bc047fbc360e07fca93e04b9cc9eb25ebae34ea
+size 76723

data/24/24_src_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b35cede39defec37318688269cbbc9b1ee30231ab477fa5731dfe0bfc4cb1df0
+size 52512

data/34/34_src_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:839721238c41b2936c6f11f6e200bf715cd159181fc229e2b04a9c0de0f3fc7c
+size 49644

data/35/35_ref_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c6a343703ee8e5e2c9152552d34f92888060d97943dc3b80a549324d3418a043
+size 275054

data/35/35_src_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a2b53bc2e870425d6170250882fa4b46304d666e818cd4eac555dff1be9e02e4
+size 273725

data/40/40_ref_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7ba5ba16d27462a8a073615db567880b90135af68ebde8493767104eb50fceb5
+size 886419

data/40/40_src_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ddefa7bc75f270d9e413453b8f5f7ae1365f4eb6fa251a947b3e3642bc7f3a0f
+size 174723

data/60/60_ref_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:513c0f30941d2002bd2c0142e318132b7ccf5c2868b1cb8cf4a673ef2514307a
+size 48648

data/60/60_src_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cf415dfb4b6683eb90fdbe266e69b45c3ec4c6667b28dbe7768aaccaf7cb0274
+size 43324

data/67/67_ref_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:567a1d5ea8904ef8e3fbbe4282550b100c883400830d9ed2b9b3fcc6d35b6e7d
+size 549697

data/67/67_src_0.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:36336d97a8e2450d722dadf259e434a6dbe93a4d53a2b1e3c6c0bdc4c289e338
+size 537438

graders.py ADDED Viewed

	@@ -0,0 +1,183 @@

+"""Grading functions for the Financial Task Environment.
+Two grading modes:
+  1. QA tasks  — compare agent text answer against reference text
+                 (numeric extraction + keyword matching)
+  2. MODIFY tasks — compare agent-produced xlsx against reference xlsx
+                    (cell-level comparison with tolerance)
+"""
+from __future__ import annotations
+import re
+import traceback
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Sequence
+import openpyxl
+from openpyxl.utils import get_column_letter
+# ---------------------------------------------------------------------------
+# Numeric helpers
+# ---------------------------------------------------------------------------
+def _extract_numbers(text: str) -> List[float]:
+    """Extract all numeric values from text, handling commas, $, %."""
+    cleaned = text.replace("$", "").replace("€", "").replace("£", "")
+    pattern = r"-?\d{1,3}(?:,\d{3})*(?:\.\d+)?|-?\d+(?:\.\d+)?"
+    raw = re.findall(pattern, cleaned)
+    results: List[float] = []
+    for r in raw:
+        try:
+            results.append(float(r.replace(",", "")))
+        except ValueError:
+            continue
+    return results
+def _number_close(actual: float, expected: float, rel_tol: float = 0.05) -> bool:
+    if expected == 0:
+        return abs(actual) < 1e-6
+    return abs(actual - expected) / abs(expected) <= rel_tol
+def _best_number_match(numbers: List[float], target: float, rel_tol: float = 0.05) -> bool:
+    return any(_number_close(n, target, rel_tol) for n in numbers)
+# ---------------------------------------------------------------------------
+# QA grading (text answer)
+# ---------------------------------------------------------------------------
+def grade_qa(answer: str, reference_answer: str) -> float:
+    """Grade a text answer against a reference.  Returns 0.0–1.0."""
+    if not answer.strip():
+        return 0.0
+    ref_nums = _extract_numbers(reference_answer)
+    ans_nums = _extract_numbers(answer)
+    if ref_nums:
+        # Numeric comparison: what fraction of reference numbers appear?
+        matched = sum(1 for r in ref_nums if _best_number_match(ans_nums, r))
+        num_score = matched / len(ref_nums)
+    else:
+        num_score = 0.0
+    # Keyword overlap
+    ref_words = set(re.findall(r"[a-zA-Z]{3,}", reference_answer.lower()))
+    ans_words = set(re.findall(r"[a-zA-Z]{3,}", answer.lower()))
+    if ref_words:
+        kw_score = len(ref_words & ans_words) / len(ref_words)
+    else:
+        kw_score = 0.0
+    # Weighted combination (numbers matter more for financial tasks)
+    if ref_nums:
+        # If all numbers match perfectly, give full score
+        if num_score >= 1.0:
+            return 1.0
+        return round(min(0.8 * num_score + 0.2 * kw_score, 1.0), 4)
+    else:
+        return round(kw_score, 4)
+# ---------------------------------------------------------------------------
+# MODIFY grading (xlsx comparison)
+# ---------------------------------------------------------------------------
+def _load_wb_values(path: str):
+    """Load workbook in data_only mode, return dict of {(sheet, row, col): value}."""
+    wb = openpyxl.load_workbook(path, data_only=True)
+    cells = {}
+    sheets = set()
+    for name in wb.sheetnames:
+        sheets.add(name)
+        ws = wb[name]
+        for row in ws.iter_rows():
+            for cell in row:
+                if cell.value is not None:
+                    cells[(name, cell.row, cell.column)] = cell.value
+    wb.close()
+    return cells, sheets
+def grade_xlsx(output_path: str, reference_path: str) -> float:
+    """Compare agent output xlsx with reference xlsx.  Returns 0.0–1.0.
+    Scoring breakdown:
+      - 30%  sheet-level: does the output have all reference sheets?
+      - 70%  cell-level:  fraction of reference cells matched (with tolerance for numbers)
+    """
+    try:
+        ref_cells, ref_sheets = _load_wb_values(reference_path)
+        out_cells, out_sheets = _load_wb_values(output_path)
+    except Exception:
+        return 0.0
+    # --- Sheet score (30%) ---
+    if ref_sheets:
+        sheet_score = len(ref_sheets & out_sheets) / len(ref_sheets)
+    else:
+        sheet_score = 1.0
+    # --- Cell score (70%) ---
+    if not ref_cells:
+        return round(0.3 * sheet_score + 0.7 * 1.0, 4)
+    matched = 0
+    total = len(ref_cells)
+    for key, ref_val in ref_cells.items():
+        out_val = out_cells.get(key)
+        if out_val is None:
+            continue
+        if ref_val == out_val:
+            matched += 1
+            continue
+        # Numeric tolerance
+        try:
+            rv = float(ref_val)
+            ov = float(out_val)
+            if _number_close(ov, rv, rel_tol=0.02):
+                matched += 1
+                continue
+        except (ValueError, TypeError):
+            pass
+        # String comparison (case-insensitive, whitespace-normalized)
+        try:
+            if str(ref_val).strip().lower() == str(out_val).strip().lower():
+                matched += 1
+        except Exception:
+            pass
+    cell_score = matched / total if total > 0 else 1.0
+    return round(0.3 * sheet_score + 0.7 * cell_score, 4)
+# ---------------------------------------------------------------------------
+# Dispatcher
+# ---------------------------------------------------------------------------
+def grade_task(task: Dict[str, Any], answer: str = "", output_path: str = "") -> float:
+    """Grade a task.  Returns 0.0–1.0.
+    For QA tasks:    uses *answer* (text) vs task["reference_answer"].
+    For MODIFY tasks: uses *output_path* (xlsx) vs task["reference_file"].
+    """
+    task_type = task.get("task_type", "QA")
+    if task_type == "QA":
+        ref = task.get("reference_answer", "")
+        return grade_qa(answer, ref)
+    elif task_type == "MODIFY":
+        ref_path = task.get("reference_file", "")
+        if not output_path or not ref_path:
+            return 0.0
+        if not Path(output_path).exists() or not Path(ref_path).exists():
+            return 0.0
+        return grade_xlsx(output_path, ref_path)
+    else:
+        return 0.0

inference.py ADDED Viewed

	@@ -0,0 +1,325 @@

+#!/usr/bin/env python3
+"""Baseline inference script for the Financial Task Environment.
+Runs an LLM agent against all 10 tasks.  The agent generates Python code
+to read/modify Excel workbooks, then submits answers or modified files.
+Uses WebSocket for persistent sessions (HTTP endpoints are stateless).
+Environment variables
+─────────────────────
+  API_BASE_URL   LLM API endpoint  (required)
+  MODEL_NAME     Model identifier  (required)
+  HF_TOKEN       Hugging Face / API key  (required)
+  ENV_URL        Environment server URL (default: http://localhost:8000)
+"""
+from __future__ import annotations
+import asyncio
+import json
+import os
+import re
+import sys
+import textwrap
+from typing import Any, Dict, List, Optional
+from openai import OpenAI
+# ---------------------------------------------------------------------------
+# Configuration from environment
+# ---------------------------------------------------------------------------
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.environ.get("MODEL_NAME", "MiniMaxAI/MiniMax-M2.5")
+HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
+ENV_URL = os.environ.get("ENV_URL", "http://localhost:8000")
+BENCHMARK = "financial_task_env"
+MAX_STEPS = 10
+TEMPERATURE = 0.0
+MAX_TOKENS = 12000
+TASK_IDS = [
+    "task_1", "task_2", "task_3",  # easy (QA)
+    "task_5", "task_8",              # medium + hard (MODIFY)
+]
+SYSTEM_PROMPT = textwrap.dedent("""\
+You are an expert financial analyst and Python programmer.
+You are working with a real Excel workbook. The file path is given to you.
+CRITICAL RULES:
+1. Do NOT call reset(). Just write plain Python code.
+2. Use the EXACT file path provided. Do not guess paths.
+3. Each code block runs in a FRESH subprocess — you must re-import and re-open
+   the workbook every time. Variables do NOT persist between steps.
+4. Use print() liberally to see data. Read the output carefully before your next step.
+5. You have limited steps. Be efficient — explore in step 1, solve in step 2-3, submit.
+RESPONSE FORMAT — use EXACTLY one of:
+To run Python code:
+```python
+your code here
+```
+To submit a text answer (QA tasks):
+SUBMIT_ANSWER: your answer here
+To submit a modified file (MODIFY tasks):
+SUBMIT_FILE: /path/to/saved.xlsx
+STRATEGY:
+- Step 1: Run code to explore the spreadsheet structure and data
+- Step 2-3: Run code to compute the answer or make modifications
+- Then SUBMIT immediately. Do not waste steps.
+For MODIFY tasks: load the workbook, make changes, save it back to the SAME path,
+then use SUBMIT_FILE with that path.
+""")
+# ---------------------------------------------------------------------------
+# Logging helpers (strict hackathon format)
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    done_val = str(done).lower()
+    error_val = str(error).lower() if error else "none"
+    short_action = action[:120].replace("\n", " ")
+    print(
+        f"[STEP] step={step} action={short_action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
+# ---------------------------------------------------------------------------
+# WebSocket environment interaction
+# ---------------------------------------------------------------------------
+async def ws_send_recv(ws, message: dict) -> dict:
+    """Send a message and receive a response over WebSocket."""
+    await ws.send(json.dumps(message))
+    resp = json.loads(await ws.recv())
+    if resp.get("type") == "error":
+        raise RuntimeError(f"Server error: {resp.get('data', {}).get('message', 'unknown')}")
+    return resp
+async def ws_reset(ws, task_id: str) -> dict:
+    """Reset the environment via WebSocket."""
+    resp = await ws_send_recv(ws, {"type": "reset", "data": {"task_id": task_id}})
+    data = resp.get("data", {})
+    obs = data.get("observation", data)
+    return {
+        "observation": obs,
+        "reward": data.get("reward", 0.0),
+        "done": data.get("done", False),
+    }
+async def ws_step(ws, action_type: str, content: str) -> dict:
+    """Execute a step via WebSocket."""
+    resp = await ws_send_recv(ws, {
+        "type": "step",
+        "data": {"action_type": action_type, "content": content},
+    })
+    data = resp.get("data", {})
+    obs = data.get("observation", data)
+    return {
+        "observation": obs,
+        "reward": data.get("reward", 0.0),
+        "done": data.get("done", False),
+    }
+# ---------------------------------------------------------------------------
+# LLM interaction
+# ---------------------------------------------------------------------------
+def get_model_response(client: OpenAI, messages: List[Dict[str, str]]) -> str:
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=messages,
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        return (completion.choices[0].message.content or "").strip()
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return ""
+def extract_action(response: str):
+    """Parse model response into (action_type, content)."""
+    if "SUBMIT_ANSWER:" in response:
+        answer = response.split("SUBMIT_ANSWER:", 1)[1].strip()
+        return "submit", answer
+    if "SUBMIT_FILE:" in response:
+        path = response.split("SUBMIT_FILE:", 1)[1].strip()
+        return "submit_file", path
+    # Extract code block
+    m = re.search(r"```python\s*\n(.*?)```", response, re.DOTALL)
+    if m:
+        return "code", m.group(1).strip()
+    m = re.search(r"```\s*\n(.*?)```", response, re.DOTALL)
+    if m:
+        code = m.group(1).strip()
+        if "import" in code or "openpyxl" in code or "print" in code:
+            return "code", code
+    # Fallback: if it looks like code, treat as code
+    if response.strip().startswith("import ") or "openpyxl" in response:
+        return "code", response.strip()
+    # Otherwise treat as text answer
+    return "submit", response.strip()
+# ---------------------------------------------------------------------------
+# Main loop
+# ---------------------------------------------------------------------------
+def _to_ws_url(http_url: str) -> str:
+    """Convert http(s):// URL to ws(s):// URL."""
+    return http_url.replace("https://", "wss://").replace("http://", "ws://")
+async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
+    import websockets
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    rewards: List[float] = []
+    steps_taken = 0
+    final_score = 0.0
+    success = False
+    try:
+        async with websockets.connect(f"{ws_url}/ws", open_timeout=30, max_size=100 * 1024 * 1024) as ws:
+            # Reset
+            reset_data = await ws_reset(ws, task_id)
+            obs = reset_data["observation"]
+            task_desc = obs.get("task_description", "")
+            feedback = obs.get("feedback", "")
+            source_file = obs.get("source_file", "")
+            task_type = obs.get("task_type", "QA")
+            messages = [
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": (
+                    f"{task_desc}\n\n"
+                    f"Source file path: {source_file}\n"
+                    f"Task type: {task_type}\n\n"
+                    f"{feedback}"
+                )},
+            ]
+            for step_num in range(1, MAX_STEPS + 1):
+                response = get_model_response(client, messages)
+                if not response:
+                    break
+                action_type, content = extract_action(response)
+                messages.append({"role": "assistant", "content": response})
+                step_data = await ws_step(ws, action_type, content)
+                step_obs = step_data["observation"]
+                reward = float(step_data.get("reward") or 0)
+                done = step_data.get("done", False)
+                step_feedback = step_obs.get("feedback", "")
+                rewards.append(reward)
+                steps_taken = step_num
+                log_step(
+                    step=step_num,
+                    action=f"[{action_type}] {content[:80]}",
+                    reward=reward,
+                    done=done,
+                    error=None,
+                )
+                if done:
+                    final_score = reward
+                    success = final_score >= 0.5
+                    break
+                # Feed the execution result back to the LLM
+                remaining = MAX_STEPS - step_num
+                urgency = ""
+                if remaining <= 2:
+                    urgency = f"\n\n⚠ Only {remaining} step(s) remaining! You MUST submit now."
+                    if task_type == "QA":
+                        urgency += " Use: SUBMIT_ANSWER: <your answer>"
+                    else:
+                        urgency += f" Save the file and use: SUBMIT_FILE: {source_file}"
+                messages.append({"role": "user", "content": (
+                    f"Code execution result (step {step_num}/{MAX_STEPS}):\n"
+                    f"{step_feedback}\n\n"
+                    f"Source file: {source_file}{urgency}"
+                )})
+            # Send close
+            try:
+                await ws.send(json.dumps({"type": "close"}))
+            except Exception:
+                pass
+    except Exception as exc:
+        print(f"[DEBUG] Task {task_id} error: {exc}", flush=True)
+        log_step(step=steps_taken + 1, action="error", reward=0.0, done=True, error=str(exc))
+    log_end(success=success, steps=steps_taken, score=final_score, rewards=rewards)
+    return final_score
+async def async_main() -> None:
+    if not API_BASE_URL:
+        print("ERROR: API_BASE_URL not set.", file=sys.stderr)
+        sys.exit(1)
+    if not MODEL_NAME:
+        print("ERROR: MODEL_NAME not set.", file=sys.stderr)
+        sys.exit(1)
+    if not HF_TOKEN:
+        print("ERROR: HF_TOKEN environment variable not set.", file=sys.stderr)
+        sys.exit(1)
+    client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
+    ws_url = _to_ws_url(ENV_URL)
+    all_scores: List[float] = []
+    for task_id in TASK_IDS:
+        print(f"\n{'='*60}\nRunning {task_id}...\n{'='*60}", flush=True)
+        score = await run_task(client, ws_url, task_id)
+        all_scores.append(score)
+        print(f"  -> {task_id} score: {score:.3f}", flush=True)
+    avg = sum(all_scores) / len(all_scores) if all_scores else 0.0
+    print(
+        f"\n{'='*60}\nOVERALL AVERAGE SCORE: {avg:.3f}\n"
+        f"Per-task: {[f'{s:.3f}' for s in all_scores]}\n{'='*60}",
+        flush=True,
+    )
+def main() -> None:
+    asyncio.run(async_main())
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,44 @@

+"""Typed Pydantic models for the Financial Task Environment."""
+from typing import Any, Dict
+from pydantic import Field
+from openenv.core.env_server.types import Action, Observation, State
+class FinancialAction(Action):
+    """Action model for the Financial Task Environment.
+    Agents interact by executing Python code to read/modify xlsx files,
+    or by submitting a text answer / file path.
+    """
+    action_type: str = Field(
+        description="Action type: 'code' to execute Python, 'submit' for text answer, 'submit_file' for xlsx"
+    )
+    content: str = Field(
+        description="Python code when action_type='code', text answer for 'submit', file path for 'submit_file'"
+    )
+class FinancialObservation(Observation):
+    """Observation model for the Financial Task Environment.
+    Contains the task description, financial data, and feedback from
+    the environment after each action.
+    """
+    task_id: str = Field(default="", description="Current task identifier")
+    task_description: str = Field(default="", description="Task instructions")
+    financial_data: str = Field(default="", description="Financial data / xlsx summary")
+    difficulty: str = Field(default="", description="Task difficulty: easy, medium, or hard")
+    feedback: str = Field(default="", description="Feedback on the last action taken")
+    current_step: int = Field(default=0, description="Current step number in the episode")
+    max_steps: int = Field(default=15, description="Maximum steps allowed per episode")
+    task_type: str = Field(default="", description="Type of financial task: QA or MODIFY")
+    source_file: str = Field(default="", description="Path to the working xlsx file")
+    available_tasks: str = Field(
+        default="",
+        description="Comma-separated list of available task IDs (shown on reset)",
+    )

openenv.yaml ADDED Viewed

	@@ -0,0 +1,47 @@

+spec_version: 1
+name: financial_task_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000
+tasks:
+  - id: task_1
+    name: Count Plants in Spreadsheet
+    difficulty: easy
+    max_steps: 15
+    grader:
+      type: programmatic
+      description: "QA grading — extracts numbers from agent answer, compares against reference (85). Score 0.0–1.0 based on numeric match with 5% tolerance."
+  - id: task_2
+    name: Retrieve TW EOL Charge
+    difficulty: easy
+    max_steps: 15
+    grader:
+      type: programmatic
+      description: "QA grading — extracts numbers from agent answer, compares against reference (113291). Score 0.0–1.0 based on numeric match with 5% tolerance."
+  - id: task_3
+    name: Portfolio Mark-to-Market Change
+    difficulty: easy
+    max_steps: 15
+    grader:
+      type: programmatic
+      description: "QA grading — extracts numbers from agent answer, compares against reference values ($1,989,600 and 27.9%). Score 0.0–1.0 based on numeric match + keyword overlap."
+  - id: task_5
+    name: Audit and Correct Formula Errors
+    difficulty: medium
+    max_steps: 15
+    grader:
+      type: programmatic
+      description: "MODIFY grading — compares agent output xlsx cell-by-cell against reference workbook. 30% sheet-name match + 70% cell-level match (2% numeric tolerance). Score 0.0–1.0."
+  - id: task_8
+    name: Balance Sheet Validation and Indicators
+    difficulty: hard
+    max_steps: 15
+    grader:
+      type: programmatic
+      description: "MODIFY grading — compares agent output xlsx cell-by-cell against reference workbook. 30% sheet-name match + 70% cell-level match (2% numeric tolerance). Score 0.0–1.0."

pyproject.toml ADDED Viewed

	@@ -0,0 +1,35 @@

+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "financial-task-env"
+version = "0.1.0"
+description = "OpenEnv environment for real-world finance & accounting tasks"
+readme = "README.md"
+license = {text = "MIT"}
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core>=0.2.0",
+    "fastapi>=0.104.0",
+    "uvicorn>=0.24.0",
+    "pydantic>=2.0.0",
+    "websockets>=12.0",
+    "openpyxl>=3.1.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest",
+    "httpx",
+    "openai",
+]
+inference = [
+    "openai",
+]
+[project.scripts]
+server = "server.app:main"
+[tool.hatch.build.targets.wheel]
+packages = ["."]

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,39 @@

+FROM python:3.11-slim
+WORKDIR /app/env
+# Install system dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git curl \
+    && rm -rf /var/lib/apt/lists/*
+# Install uv for fast dependency management
+RUN curl -LsSf https://astral.sh/uv/install.sh | sh && \
+    mv /root/.local/bin/uv /usr/local/bin/uv && \
+    mv /root/.local/bin/uvx /usr/local/bin/uvx
+# Copy dependency files first for better caching
+COPY pyproject.toml /app/env/
+# Install dependencies
+RUN uv pip install --system --no-cache-dir \
+    "openenv-core>=0.2.0" \
+    "fastapi>=0.104.0" \
+    "uvicorn>=0.24.0" \
+    "pydantic>=2.0.0" \
+    "websockets>=12.0" \
+    "openpyxl>=3.1.0"
+# Copy environment code
+COPY . /app/env/
+# Set PYTHONPATH so imports work
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+EXPOSE 8000
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Financial Task Environment — server-side implementation."""

server/app.py ADDED Viewed

	@@ -0,0 +1,24 @@

+"""FastAPI application for the Financial Task Environment."""
+from openenv.core.env_server.http_server import create_app
+from models import FinancialAction, FinancialObservation
+from server.financial_environment import FinancialEnvironment
+app = create_app(
+    FinancialEnvironment,
+    FinancialAction,
+    FinancialObservation,
+    env_name="financial_task_env",
+)
+def main() -> None:
+    """Entry point for direct execution."""
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+if __name__ == "__main__":
+    main()

server/financial_environment.py ADDED Viewed

	@@ -0,0 +1,297 @@

+"""Financial Task Environment — core environment logic.
+A code-execution environment where the agent writes Python code (using openpyxl)
+to read, analyze, and modify real Excel workbooks from enterprise finance workflows.
+For QA tasks: the agent reads the xlsx and submits a text answer.
+For MODIFY tasks: the agent writes code that modifies the xlsx, then the result
+is compared cell-by-cell against a reference workbook.
+"""
+from __future__ import annotations
+import io
+import openpyxl
+import os
+import shutil
+import subprocess
+import sys
+import tempfile
+import traceback
+from pathlib import Path
+from typing import Any, Optional
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import Observation, State
+from models import FinancialAction, FinancialObservation
+from tasks import TASKS, TASK_IDS, get_task
+from graders import grade_task
+class FinancialEnvironment(Environment):
+    """OpenEnv environment for financial spreadsheet tasks with code execution.
+    Episode flow
+    ────────────
+    1. ``reset(task_id="task_1")`` → observation with task info + xlsx summary.
+    2. ``step(action_type="code", content="import openpyxl; ...")`` → execute code, get stdout.
+    3. ``step(action_type="submit", content="answer text")`` → grade and end episode.
+       *or* for MODIFY tasks:
+       ``step(action_type="submit_file", content="<path>")`` → grade xlsx and end.
+    The episode also ends when *max_steps* is reached.
+    """
+    MAX_STEPS = 15
+    def __init__(self) -> None:
+        super().__init__()
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._current_task: dict[str, Any] | None = None
+        self._done = False
+        self._cumulative_reward = 0.0
+        self._workdir: str | None = None
+    # ------------------------------------------------------------------
+    # reset
+    # ------------------------------------------------------------------
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> FinancialObservation:
+        task_id: str = kwargs.get("task_id", "task_1")
+        self._current_task = get_task(task_id)
+        self._state = State(
+            episode_id=episode_id or str(uuid4()),
+            step_count=0,
+        )
+        self._done = False
+        self._cumulative_reward = 0.0
+        # Create a working directory and copy the source xlsx into it
+        self._workdir = tempfile.mkdtemp(prefix=f"financial_env_{task_id}_")
+        src = self._current_task.get("source_file", "")
+        if src and Path(src).exists():
+            shutil.copy2(src, self._workdir)
+            work_file = str(Path(self._workdir) / Path(src).name)
+        else:
+            work_file = ""
+        # Generate an xlsx summary to include in the observation
+        xlsx_summary = self._summarize_xlsx(work_file) if work_file else "No source file."
+        task = self._current_task
+        task_info = (
+            f"Task: {task['title']}\n"
+            f"Difficulty: {task['difficulty']}\n"
+            f"Type: {task['task_type']} ({task['category']})\n\n"
+            f"Instruction:\n{task['instruction']}\n"
+        )
+        if task.get("constraints"):
+            task_info += f"\nConstraints:\n{task['constraints']}\n"
+        task_info += (
+            f"\nSource file: {work_file}\n"
+            f"\nSpreadsheet Summary:\n{xlsx_summary}\n\n"
+            "Actions:\n"
+            "  action_type='code'    → Execute Python code (openpyxl available).\n"
+            "                          The working file path is in the source_file field.\n"
+            "  action_type='submit'  → Submit a text answer (QA tasks).\n"
+            "  action_type='submit_file' → Submit a modified xlsx path (MODIFY tasks).\n"
+        )
+        return FinancialObservation(
+            done=False,
+            reward=0.0,
+            task_id=task["id"],
+            task_description=task_info,
+            financial_data=xlsx_summary,
+            difficulty=task["difficulty"],
+            task_type=task["task_type"],
+            feedback="Environment reset. Read the spreadsheet and task instructions carefully.",
+            current_step=0,
+            max_steps=self.MAX_STEPS,
+            available_tasks=",".join(TASK_IDS),
+            source_file=work_file,
+        )
+    # ------------------------------------------------------------------
+    # step
+    # ------------------------------------------------------------------
+    def step(
+        self,
+        action: FinancialAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> FinancialObservation:
+        self._state.step_count += 1
+        if self._current_task is None:
+            return self._obs(feedback="No task loaded. Call reset() first.", reward=0.0, done=True)
+        if self._done:
+            return self._obs(feedback="Episode already finished. Call reset().", reward=0.0, done=True)
+        action_type = action.action_type.strip().lower()
+        if action_type == "code":
+            return self._handle_code(action.content)
+        elif action_type == "submit":
+            return self._handle_submit_text(action.content)
+        elif action_type == "submit_file":
+            return self._handle_submit_file(action.content)
+        else:
+            return self._obs(
+                feedback=f"Unknown action_type '{action.action_type}'. Use 'code', 'submit', or 'submit_file'.",
+                reward=0.0, done=False,
+            )
+    # ------------------------------------------------------------------
+    # state property
+    # ------------------------------------------------------------------
+    @property
+    def state(self) -> State:
+        return self._state
+    # ------------------------------------------------------------------
+    # Code execution
+    # ------------------------------------------------------------------
+    def _handle_code(self, code: str) -> FinancialObservation:
+        """Execute Python code in a subprocess and return stdout/stderr."""
+        if not self._workdir:
+            return self._obs(feedback="No working directory. Call reset() first.", reward=0.0, done=False)
+        # Small reward for taking an action
+        reward = 0.02
+        self._cumulative_reward += reward
+        try:
+            result = subprocess.run(
+                [sys.executable, "-c", code],
+                capture_output=True,
+                text=True,
+                timeout=30,
+                cwd=self._workdir,
+                env={**os.environ, "PYTHONDONTWRITEBYTECODE": "1"},
+            )
+            stdout = result.stdout[:4000] if result.stdout else ""
+            stderr = result.stderr[:2000] if result.stderr else ""
+            if result.returncode == 0:
+                feedback = f"Code executed successfully.\n\nSTDOUT:\n{stdout}"
+                if stderr:
+                    feedback += f"\n\nSTDERR:\n{stderr}"
+            else:
+                feedback = f"Code execution failed (exit code {result.returncode}).\n\nSTDERR:\n{stderr}"
+                if stdout:
+                    feedback += f"\n\nSTDOUT:\n{stdout}"
+        except subprocess.TimeoutExpired:
+            feedback = "Code execution timed out (30s limit)."
+        except Exception as e:
+            feedback = f"Code execution error: {e}"
+        at_limit = self._state.step_count >= self.MAX_STEPS
+        if at_limit:
+            self._done = True
+            feedback += "\n\n⚠ Maximum steps reached — episode ending."
+        return self._obs(feedback=feedback, reward=reward, done=at_limit)
+    # ------------------------------------------------------------------
+    # Submit handlers
+    # ------------------------------------------------------------------
+    def _handle_submit_text(self, answer: str) -> FinancialObservation:
+        """Grade a text answer (for QA tasks)."""
+        task = self._current_task
+        assert task is not None
+        score = grade_task(task, answer=answer)
+        self._done = True
+        self._cumulative_reward += score
+        quality = "Excellent" if score >= 0.9 else "Good" if score >= 0.7 else "Partial" if score >= 0.4 else "Needs improvement"
+        return self._obs(
+            feedback=f"Answer graded. Score: {score:.2f}/1.00 — {quality}.\nCumulative reward: {self._cumulative_reward:.2f}",
+            reward=score, done=True,
+        )
+    def _handle_submit_file(self, file_path: str) -> FinancialObservation:
+        """Grade a modified xlsx file (for MODIFY tasks)."""
+        task = self._current_task
+        assert task is not None
+        # Resolve relative paths against workdir
+        p = Path(file_path)
+        if not p.is_absolute() and self._workdir:
+            p = Path(self._workdir) / p
+        if not p.exists():
+            self._done = True
+            return self._obs(
+                feedback=f"File not found: {p}. Score: 0.00",
+                reward=0.0, done=True,
+            )
+        score = grade_task(task, output_path=str(p))
+        self._done = True
+        self._cumulative_reward += score
+        quality = "Excellent" if score >= 0.9 else "Good" if score >= 0.7 else "Partial" if score >= 0.4 else "Needs improvement"
+        return self._obs(
+            feedback=f"File graded. Score: {score:.2f}/1.00 — {quality}.\nCumulative reward: {self._cumulative_reward:.2f}",
+            reward=score, done=True,
+        )
+    # ------------------------------------------------------------------
+    # Helpers
+    # ------------------------------------------------------------------
+    def _summarize_xlsx(self, path: str) -> str:
+        """Return a text summary of an xlsx file (sheet names, dimensions, sample data)."""
+        try:
+            wb = openpyxl.load_workbook(path, data_only=True, read_only=True)
+            lines = [f"Workbook: {Path(path).name}", f"Sheets: {wb.sheetnames}", ""]
+            for name in wb.sheetnames[:5]:  # Limit to 5 sheets
+                ws = wb[name]
+                lines.append(f"--- Sheet: {name} (rows≈{ws.max_row}, cols≈{ws.max_column}) ---")
+                # Show first 8 rows
+                row_count = 0
+                for row in ws.iter_rows(max_row=8, values_only=True):
+                    vals = [str(v)[:30] if v is not None else "" for v in row[:12]]
+                    lines.append("  " + " | ".join(vals))
+                    row_count += 1
+                if ws.max_row and ws.max_row > 8:
+                    lines.append(f"  ... ({ws.max_row - 8} more rows)")
+                lines.append("")
+            wb.close()
+            return "\n".join(lines)
+        except Exception as e:
+            return f"Could not read xlsx: {e}"
+    def _obs(self, *, feedback: str, reward: float, done: bool) -> FinancialObservation:
+        task = self._current_task or {}
+        work_file = ""
+        if self._workdir and task.get("source_file"):
+            work_file = str(Path(self._workdir) / Path(task["source_file"]).name)
+        return FinancialObservation(
+            done=done,
+            reward=reward,
+            task_id=task.get("id", ""),
+            task_description=task.get("instruction", ""),
+            financial_data="",
+            difficulty=task.get("difficulty", ""),
+            task_type=task.get("task_type", ""),
+            feedback=feedback,
+            current_step=self._state.step_count,
+            max_steps=self.MAX_STEPS,
+            available_tasks=",".join(TASK_IDS),
+            source_file=work_file,
+        )
+    def close(self) -> None:
+        """Clean up the temporary working directory."""
+        if self._workdir and Path(self._workdir).exists():
+            shutil.rmtree(self._workdir, ignore_errors=True)
+        self._workdir = None

tasks.py ADDED Viewed

	@@ -0,0 +1,284 @@

+"""Task definitions for the Financial Task Environment.
+Contains 10 tasks backed by real Excel workbooks covering diverse enterprise
+finance & accounting workflows (QA, calculation, validation, data entry,
+formatting, modeling, consolidation).  Each task ships a source .xlsx that
+the agent must read or modify via Python code execution.
+"""
+from __future__ import annotations
+import os
+from pathlib import Path
+from typing import Any, Dict, List
+# Base directory where xlsx files live (data/<task_id>/)
+DATA_DIR = Path(os.environ.get("FINANCIAL_ENV_DATA_DIR", Path(__file__).parent / "data"))
+TASKS: Dict[str, Dict[str, Any]] = {}
+# ---------------------------------------------------------------------------
+# Helper to build source / reference paths
+# ---------------------------------------------------------------------------
+def _paths(task_id: str, src: str, ref: str | None = None):
+    """Return dict with resolved source and optional reference paths."""
+    d: Dict[str, Any] = {
+        "source_file": str(DATA_DIR / task_id / src),
+    }
+    if ref:
+        d["reference_file"] = str(DATA_DIR / task_id / ref)
+    return d
+# ── EASY ──────────────────────────────────────────────────────────────────
+# Task 1 — QA: count rows (Calculation)
+TASKS["task_1"] = {
+    "id": "task_1",
+    "orig_id": "119",
+    "title": "Count Plants in Spreadsheet",
+    "difficulty": "easy",
+    "task_type": "QA",
+    "category": "Calculation",
+    "instruction": "How many plants are recorded in the spreadsheet?",
+    "constraints": "",
+    "reference_answer": "85",
+    **_paths("119", "119_src_0.xlsx"),
+}
+# Task 2 — QA: value retrieval (Cross-sheet Retrieval)
+TASKS["task_2"] = {
+    "id": "task_2",
+    "orig_id": "118",
+    "title": "Retrieve TW EOL Charge",
+    "difficulty": "easy",
+    "task_type": "QA",
+    "category": "Cross-sheet/file Retrieval",
+    "instruction": "What is the TW EOL charge for 2002? Please provide just the amount.",
+    "constraints": "",
+    "reference_answer": "113291",
+    **_paths("118", "118_src_0.xlsx"),
+}
+# Task 3 — QA: multi-step calculation (Calculation)
+TASKS["task_3"] = {
+    "id": "task_3",
+    "orig_id": "34",
+    "title": "Portfolio Mark-to-Market Change",
+    "difficulty": "easy",
+    "task_type": "QA",
+    "category": "Calculation",
+    "instruction": (
+        "Assume the following changes occur in the Jul\u2013Dec 2002 market: "
+        "Flat curve prices increase uniformly by $2/MWh; Peak 6x16 curve prices "
+        "increase uniformly by $5/MWh; monthly contract volumes (Flat and Peak "
+        "Total MWh) remain unchanged. Based on the 2002 table, calculate: "
+        "(1) the total added value (mark-to-market change) for the combined "
+        "Flat + Peak portfolio; and (2) what percentage of this added value "
+        "comes from the Peak 6x16 contracts rather than the Flat contracts."
+    ),
+    "constraints": "",
+    "reference_answer": (
+        "The total added value of the July\u2013December 2002 portfolio is "
+        "$1,989,600 (in absolute terms). Of this amount, approximately 27.9% "
+        "(about 28%) comes from the Peak 6x16 contracts, with the remaining "
+        "~72.1% coming from the Flat contracts."
+    ),
+    **_paths("34", "34_src_0.xlsx"),
+}
+# ── MEDIUM ────────────────────────────────────────────────────────────────
+# Task 4 — Modify: summarise imbalances (Calculation + modify)
+TASKS["task_4"] = {
+    "id": "task_4",
+    "orig_id": "35",
+    "title": "Summarize Pipeline Imbalances",
+    "difficulty": "medium",
+    "task_type": "MODIFY",
+    "category": "Calculation",
+    "instruction": (
+        "Summarize the volume and dollar imbalances that exist between the "
+        "various pipeline operators (Operators) and Transwestern."
+    ),
+    "constraints": (
+        "You will be given an Excel file as input. Perform all required "
+        "operations by modifying the existing workbook. You may add new sheets "
+        "if necessary, but you must preserve all original sheets and their "
+        "contents. Return the full updated workbook."
+    ),
+    **_paths("35", "35_src_0.xlsx", "35_ref_0.xlsx"),
+}
+# Task 5 — Modify: audit & fix formulas (Validation / Review)
+TASKS["task_5"] = {
+    "id": "task_5",
+    "orig_id": "40",
+    "title": "Audit and Correct Formula Errors",
+    "difficulty": "medium",
+    "task_type": "MODIFY",
+    "category": "Validation / Review, Calculation",
+    "instruction": (
+        "Audit the workbook and correct the formula errors in place so numbers "
+        "calculate properly."
+    ),
+    "constraints": (
+        "You will be given an Excel file as input. Perform all required "
+        "operations by modifying the existing workbook. You may add new sheets "
+        "if necessary, but you must preserve all original sheets and their "
+        "contents. Return the full updated workbook."
+    ),
+    **_paths("40", "40_src_0.xlsx", "40_ref_0.xlsx"),
+}
+# Task 6 — Modify: create table + filter (Structuring / Formatting)
+TASKS["task_6"] = {
+    "id": "task_6",
+    "orig_id": "60",
+    "title": "Create Table and Apply Filter",
+    "difficulty": "medium",
+    "task_type": "MODIFY",
+    "category": "Structuring / Formatting",
+    "instruction": (
+        "On the All Natural Gas sheet, create an Excel table and filter to "
+        "show only the COUNTERPARTY entries highlighted in red."
+    ),
+    "constraints": (
+        "You will be given an Excel file as input. Perform all required "
+        "operations by modifying the existing workbook. You may add new sheets "
+        "if necessary, but you must preserve all original sheets and their "
+        "contents. Return the full updated workbook."
+    ),
+    **_paths("60", "60_src_0.xlsx", "60_ref_0.xlsx"),
+}
+# Task 7 — Modify: data entry + formatting (Data Entry / Import)
+TASKS["task_7"] = {
+    "id": "task_7",
+    "orig_id": "21",
+    "title": "Add Weekday Row and Data Entry",
+    "difficulty": "medium",
+    "task_type": "MODIFY",
+    "category": "Data Entry / Import, Structuring / Formatting",
+    "instruction": (
+        "Add a weekday line directly below the date headers and update the "
+        "12/31/2001 (Mon) column. For that day, there are no \"Receipts\"; "
+        "record disbursements of $1,980,800 to Calpine (Power Purchases) and "
+        "$100,000 to an unspecified vendor (Gas Purchases). Under Enron Facility "
+        "Services, enter $3,101,855 for \"$2.5 per day\" and -$2,081,386 for "
+        "\"estimate receipt\"; in Personnel, EES is $584,500; leave all other "
+        "items as \"-\"."
+    ),
+    "constraints": (
+        "You will be given an Excel file as input. Perform all required "
+        "operations by modifying the existing workbook. You may add new sheets "
+        "if necessary, but you must preserve all original sheets and their "
+        "contents. Return the full updated workbook."
+    ),
+    **_paths("21", "21_src_0.xlsx", "21_ref_0.xlsx"),
+}
+# ── HARD ──────────────────────────────────────────────────────────────────
+# Task 8 — Modify: balance-sheet validation + indicator calcs
+TASKS["task_8"] = {
+    "id": "task_8",
+    "orig_id": "0",
+    "title": "Balance Sheet Validation and Indicators",
+    "difficulty": "hard",
+    "task_type": "MODIFY",
+    "category": "Validation / Review, Calculation, Structuring / Formatting",
+    "instruction": (
+        "Complete the validation and indicator calculations as follows: on the "
+        "Balance Sheet, add a control to ensure TOTAL ASSETS equals TOTAL "
+        "LIABILITIES AND EQUITY; on the Income Statement (Revenue & Expenses), "
+        "add an Equity Roll Forward Test to reconcile equity movement and "
+        "highlight any differences."
+    ),
+    "constraints": (
+        "You will be given an Excel file as input. Perform all required "
+        "operations by modifying the existing workbook. You may add new sheets "
+        "if necessary, but you must preserve all original sheets and their "
+        "contents. Return the full updated workbook."
+    ),
+    **_paths("0", "0_src_0.xlsx", "0_ref_0.xlsx"),
+}
+# Task 9 — Modify: add new sheet mirroring structure (Financial Modeling)
+TASKS["task_9"] = {
+    "id": "task_9",
+    "orig_id": "24",
+    "title": "Create Scenario3 Worksheet",
+    "difficulty": "hard",
+    "task_type": "MODIFY",
+    "category": "Structuring / Formatting, Financial Modeling",
+    "instruction": (
+        'Add a new worksheet named "Scenario3" to the same workbook, mirroring '
+        "the structure, row/column layout, monthly detail table, and chart area "
+        'of "Scenario1". For Scenario3, update the hedging assumptions to a '
+        "balanced allocation: 10-Yr 25%, 5-Yr 20%, 1-Yr 15%, May-Sep 20%, "
+        "Q3 15%. Keep the note \"Maximum Monthly Average Short Position to "
+        'Cover (July Peak) = 30,508 MW" unchanged; only the new sheet should '
+        "be added, and formulas may be used within it."
+    ),
+    "constraints": (
+        "You will be given an Excel file as input. Perform all required "
+        "operations by modifying the existing workbook. You may add new sheets "
+        "if necessary, but you must preserve all original sheets and their "
+        "contents. Return the full updated workbook."
+    ),
+    **_paths("24", "24_src_0.xlsx", "24_ref_0.xlsx"),
+}
+# Task 10 — Modify: cross-sheet consolidation (multi-type)
+TASKS["task_10"] = {
+    "id": "task_10",
+    "orig_id": "67",
+    "title": "Consolidate by Type and Area",
+    "difficulty": "hard",
+    "task_type": "MODIFY",
+    "category": "Structuring / Formatting, Calculation, Validation / Review, Cross-sheet Retrieval",
+    "instruction": (
+        "Create a new 'by type_area' worksheet based on the Summary and the "
+        "other tabs. It should present two separate tables summarized by "
+        "Imbal Type; within each table, consolidate by area, include Volume, "
+        "Value and Date, and calculate totals. Finally, confirm that the value "
+        "and volume totals tie to the totals shown on the Summary."
+    ),
+    "constraints": (
+        "You will be given an Excel file as input. Perform all required "
+        "operations by modifying the existing workbook. You may add new sheets "
+        "if necessary, but you must preserve all original sheets and their "
+        "contents. Return the full updated workbook."
+    ),
+    **_paths("67", "67_src_0.xlsx", "67_ref_0.xlsx"),
+}
+# ---------------------------------------------------------------------------
+# Helper accessors
+# ---------------------------------------------------------------------------
+TASK_IDS: List[str] = sorted(TASKS.keys(), key=lambda x: int(x.split("_")[1]))
+def get_task(task_id: str) -> Dict[str, Any]:
+    """Return a task dict by ID or raise KeyError."""
+    if task_id not in TASKS:
+        raise KeyError(
+            f"Unknown task_id '{task_id}'. Available: {', '.join(TASK_IDS)}"
+        )
+    return TASKS[task_id]
+def list_tasks() -> List[Dict[str, str]]:
+    """Return a summary list of all tasks."""
+    return [
+        {
+            "id": t["id"],
+            "title": t["title"],
+            "difficulty": t["difficulty"],
+            "task_type": t["task_type"],
+            "category": t["category"],
+        }
+        for t in (TASKS[tid] for tid in TASK_IDS)
+    ]

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff