Spaces:

bpHigh
/

financial-task-env

Sleeping

App Files Files Community

bpHigh commited on 10 days ago

Commit

bf77949

1 Parent(s): 6db9bed

Update readme

Browse files

Files changed (3) hide show

Dockerfile +1 -1
README.md +99 -8
inference.py +21 -3

Dockerfile CHANGED Viewed

@@ -31,4 +31,4 @@ HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
 EXPOSE 8000
-CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]


31
32	EXPOSE 8000
33
34	+ CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000", "--ws-ping-interval", "60", "--ws-ping-timeout", "60"]

README.md CHANGED Viewed

@@ -87,13 +87,24 @@ the same kind of work.
 | Action | Reward | Signal |
 |--------|--------|--------|
-| `code` | 0.02 | Small reward for active exploration |
-| `submit` / `submit_file` | 0.0–1.0 | Graded against reference |
 | Max steps (15) | Episode ends | |
 **QA grading:** Numeric extraction with 5% tolerance + keyword overlap.
 **MODIFY grading:** 30% sheet-name match + 70% cell-level comparison (2% numeric tolerance).
 ## Setup & Usage
 ### Prerequisites
@@ -128,11 +139,91 @@ python inference.py
 ## Baseline Scores
-| Difficulty | Type | Expected Range |
-|------------|------|---------------|
-| Easy | QA | 0.60 – 1.00 |
-| Medium | MODIFY | 0.30 – 0.80 |
-| Hard | MODIFY | 0.10 – 0.60 |
 ## Project Structure
@@ -178,7 +269,7 @@ This environment models real financial spreadsheet work:
 - **Consolidation** — aggregate data across sheets into summary views
 Each task uses a genuine enterprise Excel workbook.  MODIFY tasks are graded
-by cell-level comparison against a reference workbook.
 ## Acknowledgments

 | Action | Reward | Signal |
 |--------|--------|--------|
+| `code` (failed) | 0.005 | Penalized — syntax/runtime error |
+| `code` (simple) | ~0.02 | Minimal — just imports and a print |
+| `code` (exploration) | ~0.05 | Good — reading data, producing output |
+| `code` (modification + save) | ~0.06–0.10 | Best — actively editing the workbook |
+| `submit` / `submit_file` | 0.001–0.999 | Full grading against reference |
 | Max steps (15) | Episode ends | |
+Code step rewards are computed from:
+- **Execution success** — failed code gets only 0.005
+- **Substantive lines** — lines beyond imports/comments earn +0.002 each (up to +0.03)
+- **Output produced** — printing data earns +0.001 per line (up to +0.02)
+- **Save operations** — calling `.save()` earns +0.03 (agent is modifying the workbook)
 **QA grading:** Numeric extraction with 5% tolerance + keyword overlap.
 **MODIFY grading:** 30% sheet-name match + 70% cell-level comparison (2% numeric tolerance).
+All scores are clamped to the open interval (0.001, 0.999).
 ## Setup & Usage
 ### Prerequisites
 ## Baseline Scores
+The environment includes 10 tasks, but the baseline inference runs 5 representative
+tasks (3 easy + 1 medium + 1 hard) to stay within the 20-minute runtime constraint.
+**Model:** `MiniMaxAI/MiniMax-M2.5` via HuggingFace Router
+| Task | Difficulty | Type | Score | Step Rewards |
+|------|------------|------|-------|-------------|
+| task_1 — Count Plants | Easy | QA | 0.001 | 0.05, 0.06, 0.06, 0.06, 0.00 |
+| task_2 — Retrieve EOL Charge | Easy | QA | 0.001 | 0.04, 0.01, 0.07, 0.06, 0.02, 0.00 |
+| task_3 — Portfolio MTM Change | Easy | QA | 0.367 | 0.06, 0.01, 0.07, ..., 0.37 |
+| task_5 — Audit Formulas | Medium | MODIFY | **0.958** | 0.07, 0.01, 0.07, ..., 0.96 |
+| task_8 — Balance Sheet Validation | Hard | MODIFY | 0.001 | 0.06, 0.01, 0.06, ..., 0.05 |
+| **Average** | | | **0.266** | |
+**Runtime:** 12 min 10 sec (limit: 20 min) · **Server memory:** ~40 MB (limit: 8 GB)
+Note: Step rewards vary based on code quality — failed code gets 0.005, exploration
+~0.05, modification+save ~0.06–0.10.
+### Run 2 — `google/gemma-4-26B-A4B-it`
+| Task | Difficulty | Type | Score |
+|------|------------|------|-------|
+| task_1 — Count Plants | Easy | QA | 0.001 |
+| task_2 — Retrieve EOL Charge | Easy | QA | **0.999** |
+| task_3 — Portfolio MTM Change | Easy | QA | 0.001 |
+| task_5 — Audit Formulas | Medium | MODIFY | 0.001 |
+| task_8 — Balance Sheet Validation | Hard | MODIFY | 0.001 |
+| **Average** | | | **0.201** |
+**Runtime:** 19 min 27 sec (limit: 20 min) · **Server memory:** ~40 MB
+Gemma 4 26B solved task_2 perfectly in just 2 steps but timed out on more
+complex tasks due to longer generation times.
+### Run 3 — `Qwen/Qwen3.5-122B-A10B`
+| Task | Difficulty | Type | Score |
+|------|------------|------|-------|
+| task_1 — Count Plants | Easy | QA | 0.001 |
+| task_2 — Retrieve EOL Charge | Easy | QA | **0.999** |
+| task_3 — Portfolio MTM Change | Easy | QA | 0.001 |
+| task_5 — Audit Formulas | Medium | MODIFY | 0.001 |
+| task_8 — Balance Sheet Validation | Hard | MODIFY | 0.001 |
+| **Average** | | | **0.201** |
+**Runtime:** 2 min 11 sec · Fast inference but hit per-task timeout on complex tasks.
+### Run 4 — `deepseek-ai/DeepSeek-R1`
+| Task | Difficulty | Type | Score |
+|------|------------|------|-------|
+| task_1 — Count Plants | Easy | QA | 0.001 |
+| task_2 — Retrieve EOL Charge | Easy | QA | 0.001 |
+| task_3 — Portfolio MTM Change | Easy | QA | 0.001 |
+| task_5 — Audit Formulas | Medium | MODIFY | 0.001 |
+| task_8 — Balance Sheet Validation | Hard | MODIFY | 0.001 |
+| **Average** | | | **0.001** |
+**Runtime:** 11 min 57 sec · DeepSeek-R1's long chain-of-thought reasoning consumed
+most of the output tokens, leaving answers that didn't parse correctly.
+### Run 5 — `MiniMaxAI/MiniMax-M2.1` (Best)
+| Task | Difficulty | Type | Score | Steps |
+|------|------------|------|-------|-------|
+| task_1 — Count Plants | Easy | QA | 0.001 | 5 |
+| task_2 — Retrieve EOL Charge | Easy | QA | **0.999** | 4 |
+| task_3 — Portfolio MTM Change | Easy | QA | 0.001 | 10 |
+| task_5 — Audit Formulas | Medium | MODIFY | **0.958** | 4 |
+| task_8 — Balance Sheet Validation | Hard | MODIFY | **0.733** | 10 |
+| **Average** | | | **0.538** | |
+**Runtime:** 3 min 18 sec · Best overall performance — solved 3/5 tasks with high
+scores including the hard MODIFY task (0.733). Fast and efficient.
+### Model Comparison Summary
+| Model | Avg Score | Runtime | Best Task |
+|-------|-----------|---------|-----------|
+| **MiniMax-M2.1** | **0.538** | **3m 18s** | task_5: 0.958, task_8: 0.733 |
+| MiniMax-M2.5 | 0.266 | 12m 10s | task_5: 0.958 |
+| Gemma 4 26B | 0.201 | 19m 27s | task_2: 0.999 |
+| Qwen 3.5 122B | 0.201 | 2m 11s | task_2: 0.999 |
+| DeepSeek-R1 | 0.001 | 11m 57s | — |
 ## Project Structure
 - **Consolidation** — aggregate data across sheets into summary views
 Each task uses a genuine enterprise Excel workbook.  MODIFY tasks are graded
+by spreadsheet properties comparison against a reference workbook.
 ## Acknowledgments

inference.py CHANGED Viewed

@@ -31,7 +31,7 @@ from openai import OpenAI
 # ---------------------------------------------------------------------------
 API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1")
-MODEL_NAME = os.environ.get("MODEL_NAME", "MiniMaxAI/MiniMax-M2.5")
 HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
 ENV_URL = os.environ.get("ENV_URL", "http://localhost:8000")
@@ -91,7 +91,7 @@ def log_start(task: str, env: str, model: str) -> None:
 def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
     done_val = str(done).lower()
     error_val = str(error).lower() if error else "none"
-    short_action = action.replace("\n", " ")
     print(
         f"[STEP] step={step} action={short_action} reward={reward:.2f} done={done_val} error={error_val}",
         flush=True,
@@ -204,8 +204,12 @@ def _to_ws_url(http_url: str) -> str:
     return http_url.replace("https://", "wss://").replace("http://", "ws://")
 async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
     import websockets
     log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
@@ -213,9 +217,17 @@ async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
     steps_taken = 0
     final_score = 0.0
     success = False
     try:
-        async with websockets.connect(f"{ws_url}/ws", open_timeout=30, max_size=100 * 1024 * 1024) as ws:
             # Reset
             reset_data = await ws_reset(ws, task_id)
             obs = reset_data["observation"]
@@ -235,6 +247,12 @@ async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
             ]
             for step_num in range(1, MAX_STEPS + 1):
                 response = get_model_response(client, messages)
                 if not response:
                     break

 # ---------------------------------------------------------------------------
 API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.environ.get("MODEL_NAME", "MiniMaxAI/MiniMax-M2.1")
 HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
 ENV_URL = os.environ.get("ENV_URL", "http://localhost:8000")
 def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
     done_val = str(done).lower()
     error_val = str(error).lower() if error else "none"
+    short_action = action[:500].replace("\n", " ")
     print(
         f"[STEP] step={step} action={short_action} reward={reward:.2f} done={done_val} error={error_val}",
         flush=True,
     return http_url.replace("https://", "wss://").replace("http://", "ws://")
+TASK_TIMEOUT = 240  # 4 minutes per task (5 tasks × 4 min = 20 min max)
 async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
     import websockets
+    import time
     log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
     steps_taken = 0
     final_score = 0.0
     success = False
+    task_start = time.time()
     try:
+        async with websockets.connect(
+            f"{ws_url}/ws",
+            open_timeout=30,
+            close_timeout=10,
+            max_size=100 * 1024 * 1024,
+            ping_interval=60,
+            ping_timeout=60,
+        ) as ws:
             # Reset
             reset_data = await ws_reset(ws, task_id)
             obs = reset_data["observation"]
             ]
             for step_num in range(1, MAX_STEPS + 1):
+                # Check per-task timeout
+                elapsed = time.time() - task_start
+                if elapsed > TASK_TIMEOUT:
+                    print(f"[DEBUG] Task {task_id} timeout after {elapsed:.0f}s (limit {TASK_TIMEOUT}s)", flush=True)
+                    break
                 response = get_model_response(client, messages)
                 if not response:
                     break