Spaces:
Sleeping
Sleeping
Update readme
Browse files- Dockerfile +1 -1
- README.md +99 -8
- inference.py +21 -3
Dockerfile
CHANGED
|
@@ -31,4 +31,4 @@ HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
|
|
| 31 |
|
| 32 |
EXPOSE 8000
|
| 33 |
|
| 34 |
-
CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
|
|
|
|
| 31 |
|
| 32 |
EXPOSE 8000
|
| 33 |
|
| 34 |
+
CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000", "--ws-ping-interval", "60", "--ws-ping-timeout", "60"]
|
README.md
CHANGED
|
@@ -87,13 +87,24 @@ the same kind of work.
|
|
| 87 |
|
| 88 |
| Action | Reward | Signal |
|
| 89 |
|--------|--------|--------|
|
| 90 |
-
| `code` | 0.
|
| 91 |
-
| `
|
|
|
|
|
|
|
|
|
|
| 92 |
| Max steps (15) | Episode ends | |
|
| 93 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
**QA grading:** Numeric extraction with 5% tolerance + keyword overlap.
|
| 95 |
**MODIFY grading:** 30% sheet-name match + 70% cell-level comparison (2% numeric tolerance).
|
| 96 |
|
|
|
|
|
|
|
| 97 |
## Setup & Usage
|
| 98 |
|
| 99 |
### Prerequisites
|
|
@@ -128,11 +139,91 @@ python inference.py
|
|
| 128 |
|
| 129 |
## Baseline Scores
|
| 130 |
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
## Project Structure
|
| 138 |
|
|
@@ -178,7 +269,7 @@ This environment models real financial spreadsheet work:
|
|
| 178 |
- **Consolidation** β aggregate data across sheets into summary views
|
| 179 |
|
| 180 |
Each task uses a genuine enterprise Excel workbook. MODIFY tasks are graded
|
| 181 |
-
by
|
| 182 |
|
| 183 |
## Acknowledgments
|
| 184 |
|
|
|
|
| 87 |
|
| 88 |
| Action | Reward | Signal |
|
| 89 |
|--------|--------|--------|
|
| 90 |
+
| `code` (failed) | 0.005 | Penalized β syntax/runtime error |
|
| 91 |
+
| `code` (simple) | ~0.02 | Minimal β just imports and a print |
|
| 92 |
+
| `code` (exploration) | ~0.05 | Good β reading data, producing output |
|
| 93 |
+
| `code` (modification + save) | ~0.06β0.10 | Best β actively editing the workbook |
|
| 94 |
+
| `submit` / `submit_file` | 0.001β0.999 | Full grading against reference |
|
| 95 |
| Max steps (15) | Episode ends | |
|
| 96 |
|
| 97 |
+
Code step rewards are computed from:
|
| 98 |
+
- **Execution success** β failed code gets only 0.005
|
| 99 |
+
- **Substantive lines** β lines beyond imports/comments earn +0.002 each (up to +0.03)
|
| 100 |
+
- **Output produced** β printing data earns +0.001 per line (up to +0.02)
|
| 101 |
+
- **Save operations** β calling `.save()` earns +0.03 (agent is modifying the workbook)
|
| 102 |
+
|
| 103 |
**QA grading:** Numeric extraction with 5% tolerance + keyword overlap.
|
| 104 |
**MODIFY grading:** 30% sheet-name match + 70% cell-level comparison (2% numeric tolerance).
|
| 105 |
|
| 106 |
+
All scores are clamped to the open interval (0.001, 0.999).
|
| 107 |
+
|
| 108 |
## Setup & Usage
|
| 109 |
|
| 110 |
### Prerequisites
|
|
|
|
| 139 |
|
| 140 |
## Baseline Scores
|
| 141 |
|
| 142 |
+
The environment includes 10 tasks, but the baseline inference runs 5 representative
|
| 143 |
+
tasks (3 easy + 1 medium + 1 hard) to stay within the 20-minute runtime constraint.
|
| 144 |
+
|
| 145 |
+
**Model:** `MiniMaxAI/MiniMax-M2.5` via HuggingFace Router
|
| 146 |
+
|
| 147 |
+
| Task | Difficulty | Type | Score | Step Rewards |
|
| 148 |
+
|------|------------|------|-------|-------------|
|
| 149 |
+
| task_1 β Count Plants | Easy | QA | 0.001 | 0.05, 0.06, 0.06, 0.06, 0.00 |
|
| 150 |
+
| task_2 β Retrieve EOL Charge | Easy | QA | 0.001 | 0.04, 0.01, 0.07, 0.06, 0.02, 0.00 |
|
| 151 |
+
| task_3 β Portfolio MTM Change | Easy | QA | 0.367 | 0.06, 0.01, 0.07, ..., 0.37 |
|
| 152 |
+
| task_5 β Audit Formulas | Medium | MODIFY | **0.958** | 0.07, 0.01, 0.07, ..., 0.96 |
|
| 153 |
+
| task_8 β Balance Sheet Validation | Hard | MODIFY | 0.001 | 0.06, 0.01, 0.06, ..., 0.05 |
|
| 154 |
+
| **Average** | | | **0.266** | |
|
| 155 |
+
|
| 156 |
+
**Runtime:** 12 min 10 sec (limit: 20 min) Β· **Server memory:** ~40 MB (limit: 8 GB)
|
| 157 |
+
|
| 158 |
+
Note: Step rewards vary based on code quality β failed code gets 0.005, exploration
|
| 159 |
+
~0.05, modification+save ~0.06β0.10.
|
| 160 |
+
|
| 161 |
+
### Run 2 β `google/gemma-4-26B-A4B-it`
|
| 162 |
+
|
| 163 |
+
| Task | Difficulty | Type | Score |
|
| 164 |
+
|------|------------|------|-------|
|
| 165 |
+
| task_1 β Count Plants | Easy | QA | 0.001 |
|
| 166 |
+
| task_2 β Retrieve EOL Charge | Easy | QA | **0.999** |
|
| 167 |
+
| task_3 β Portfolio MTM Change | Easy | QA | 0.001 |
|
| 168 |
+
| task_5 β Audit Formulas | Medium | MODIFY | 0.001 |
|
| 169 |
+
| task_8 β Balance Sheet Validation | Hard | MODIFY | 0.001 |
|
| 170 |
+
| **Average** | | | **0.201** |
|
| 171 |
+
|
| 172 |
+
**Runtime:** 19 min 27 sec (limit: 20 min) Β· **Server memory:** ~40 MB
|
| 173 |
+
|
| 174 |
+
Gemma 4 26B solved task_2 perfectly in just 2 steps but timed out on more
|
| 175 |
+
complex tasks due to longer generation times.
|
| 176 |
+
|
| 177 |
+
### Run 3 β `Qwen/Qwen3.5-122B-A10B`
|
| 178 |
+
|
| 179 |
+
| Task | Difficulty | Type | Score |
|
| 180 |
+
|------|------------|------|-------|
|
| 181 |
+
| task_1 β Count Plants | Easy | QA | 0.001 |
|
| 182 |
+
| task_2 β Retrieve EOL Charge | Easy | QA | **0.999** |
|
| 183 |
+
| task_3 β Portfolio MTM Change | Easy | QA | 0.001 |
|
| 184 |
+
| task_5 β Audit Formulas | Medium | MODIFY | 0.001 |
|
| 185 |
+
| task_8 β Balance Sheet Validation | Hard | MODIFY | 0.001 |
|
| 186 |
+
| **Average** | | | **0.201** |
|
| 187 |
+
|
| 188 |
+
**Runtime:** 2 min 11 sec Β· Fast inference but hit per-task timeout on complex tasks.
|
| 189 |
+
|
| 190 |
+
### Run 4 β `deepseek-ai/DeepSeek-R1`
|
| 191 |
+
|
| 192 |
+
| Task | Difficulty | Type | Score |
|
| 193 |
+
|------|------------|------|-------|
|
| 194 |
+
| task_1 β Count Plants | Easy | QA | 0.001 |
|
| 195 |
+
| task_2 β Retrieve EOL Charge | Easy | QA | 0.001 |
|
| 196 |
+
| task_3 β Portfolio MTM Change | Easy | QA | 0.001 |
|
| 197 |
+
| task_5 β Audit Formulas | Medium | MODIFY | 0.001 |
|
| 198 |
+
| task_8 β Balance Sheet Validation | Hard | MODIFY | 0.001 |
|
| 199 |
+
| **Average** | | | **0.001** |
|
| 200 |
+
|
| 201 |
+
**Runtime:** 11 min 57 sec Β· DeepSeek-R1's long chain-of-thought reasoning consumed
|
| 202 |
+
most of the output tokens, leaving answers that didn't parse correctly.
|
| 203 |
+
|
| 204 |
+
### Run 5 β `MiniMaxAI/MiniMax-M2.1` (Best)
|
| 205 |
+
|
| 206 |
+
| Task | Difficulty | Type | Score | Steps |
|
| 207 |
+
|------|------------|------|-------|-------|
|
| 208 |
+
| task_1 β Count Plants | Easy | QA | 0.001 | 5 |
|
| 209 |
+
| task_2 β Retrieve EOL Charge | Easy | QA | **0.999** | 4 |
|
| 210 |
+
| task_3 β Portfolio MTM Change | Easy | QA | 0.001 | 10 |
|
| 211 |
+
| task_5 β Audit Formulas | Medium | MODIFY | **0.958** | 4 |
|
| 212 |
+
| task_8 β Balance Sheet Validation | Hard | MODIFY | **0.733** | 10 |
|
| 213 |
+
| **Average** | | | **0.538** | |
|
| 214 |
+
|
| 215 |
+
**Runtime:** 3 min 18 sec Β· Best overall performance β solved 3/5 tasks with high
|
| 216 |
+
scores including the hard MODIFY task (0.733). Fast and efficient.
|
| 217 |
+
|
| 218 |
+
### Model Comparison Summary
|
| 219 |
+
|
| 220 |
+
| Model | Avg Score | Runtime | Best Task |
|
| 221 |
+
|-------|-----------|---------|-----------|
|
| 222 |
+
| **MiniMax-M2.1** | **0.538** | **3m 18s** | task_5: 0.958, task_8: 0.733 |
|
| 223 |
+
| MiniMax-M2.5 | 0.266 | 12m 10s | task_5: 0.958 |
|
| 224 |
+
| Gemma 4 26B | 0.201 | 19m 27s | task_2: 0.999 |
|
| 225 |
+
| Qwen 3.5 122B | 0.201 | 2m 11s | task_2: 0.999 |
|
| 226 |
+
| DeepSeek-R1 | 0.001 | 11m 57s | β |
|
| 227 |
|
| 228 |
## Project Structure
|
| 229 |
|
|
|
|
| 269 |
- **Consolidation** β aggregate data across sheets into summary views
|
| 270 |
|
| 271 |
Each task uses a genuine enterprise Excel workbook. MODIFY tasks are graded
|
| 272 |
+
by spreadsheet properties comparison against a reference workbook.
|
| 273 |
|
| 274 |
## Acknowledgments
|
| 275 |
|
inference.py
CHANGED
|
@@ -31,7 +31,7 @@ from openai import OpenAI
|
|
| 31 |
# ---------------------------------------------------------------------------
|
| 32 |
|
| 33 |
API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1")
|
| 34 |
-
MODEL_NAME = os.environ.get("MODEL_NAME", "MiniMaxAI/MiniMax-M2.
|
| 35 |
HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
|
| 36 |
ENV_URL = os.environ.get("ENV_URL", "http://localhost:8000")
|
| 37 |
|
|
@@ -91,7 +91,7 @@ def log_start(task: str, env: str, model: str) -> None:
|
|
| 91 |
def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
|
| 92 |
done_val = str(done).lower()
|
| 93 |
error_val = str(error).lower() if error else "none"
|
| 94 |
-
short_action = action.replace("\n", " ")
|
| 95 |
print(
|
| 96 |
f"[STEP] step={step} action={short_action} reward={reward:.2f} done={done_val} error={error_val}",
|
| 97 |
flush=True,
|
|
@@ -204,8 +204,12 @@ def _to_ws_url(http_url: str) -> str:
|
|
| 204 |
return http_url.replace("https://", "wss://").replace("http://", "ws://")
|
| 205 |
|
| 206 |
|
|
|
|
|
|
|
|
|
|
| 207 |
async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
|
| 208 |
import websockets
|
|
|
|
| 209 |
|
| 210 |
log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
|
| 211 |
|
|
@@ -213,9 +217,17 @@ async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
|
|
| 213 |
steps_taken = 0
|
| 214 |
final_score = 0.0
|
| 215 |
success = False
|
|
|
|
| 216 |
|
| 217 |
try:
|
| 218 |
-
async with websockets.connect(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 219 |
# Reset
|
| 220 |
reset_data = await ws_reset(ws, task_id)
|
| 221 |
obs = reset_data["observation"]
|
|
@@ -235,6 +247,12 @@ async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
|
|
| 235 |
]
|
| 236 |
|
| 237 |
for step_num in range(1, MAX_STEPS + 1):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 238 |
response = get_model_response(client, messages)
|
| 239 |
if not response:
|
| 240 |
break
|
|
|
|
| 31 |
# ---------------------------------------------------------------------------
|
| 32 |
|
| 33 |
API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1")
|
| 34 |
+
MODEL_NAME = os.environ.get("MODEL_NAME", "MiniMaxAI/MiniMax-M2.1")
|
| 35 |
HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
|
| 36 |
ENV_URL = os.environ.get("ENV_URL", "http://localhost:8000")
|
| 37 |
|
|
|
|
| 91 |
def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
|
| 92 |
done_val = str(done).lower()
|
| 93 |
error_val = str(error).lower() if error else "none"
|
| 94 |
+
short_action = action[:500].replace("\n", " ")
|
| 95 |
print(
|
| 96 |
f"[STEP] step={step} action={short_action} reward={reward:.2f} done={done_val} error={error_val}",
|
| 97 |
flush=True,
|
|
|
|
| 204 |
return http_url.replace("https://", "wss://").replace("http://", "ws://")
|
| 205 |
|
| 206 |
|
| 207 |
+
TASK_TIMEOUT = 240 # 4 minutes per task (5 tasks Γ 4 min = 20 min max)
|
| 208 |
+
|
| 209 |
+
|
| 210 |
async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
|
| 211 |
import websockets
|
| 212 |
+
import time
|
| 213 |
|
| 214 |
log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
|
| 215 |
|
|
|
|
| 217 |
steps_taken = 0
|
| 218 |
final_score = 0.0
|
| 219 |
success = False
|
| 220 |
+
task_start = time.time()
|
| 221 |
|
| 222 |
try:
|
| 223 |
+
async with websockets.connect(
|
| 224 |
+
f"{ws_url}/ws",
|
| 225 |
+
open_timeout=30,
|
| 226 |
+
close_timeout=10,
|
| 227 |
+
max_size=100 * 1024 * 1024,
|
| 228 |
+
ping_interval=60,
|
| 229 |
+
ping_timeout=60,
|
| 230 |
+
) as ws:
|
| 231 |
# Reset
|
| 232 |
reset_data = await ws_reset(ws, task_id)
|
| 233 |
obs = reset_data["observation"]
|
|
|
|
| 247 |
]
|
| 248 |
|
| 249 |
for step_num in range(1, MAX_STEPS + 1):
|
| 250 |
+
# Check per-task timeout
|
| 251 |
+
elapsed = time.time() - task_start
|
| 252 |
+
if elapsed > TASK_TIMEOUT:
|
| 253 |
+
print(f"[DEBUG] Task {task_id} timeout after {elapsed:.0f}s (limit {TASK_TIMEOUT}s)", flush=True)
|
| 254 |
+
break
|
| 255 |
+
|
| 256 |
response = get_model_response(client, messages)
|
| 257 |
if not response:
|
| 258 |
break
|