bpHigh commited on
Commit
bf77949
Β·
1 Parent(s): 6db9bed

Update readme

Browse files
Files changed (3) hide show
  1. Dockerfile +1 -1
  2. README.md +99 -8
  3. inference.py +21 -3
Dockerfile CHANGED
@@ -31,4 +31,4 @@ HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
31
 
32
  EXPOSE 8000
33
 
34
- CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
 
31
 
32
  EXPOSE 8000
33
 
34
+ CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000", "--ws-ping-interval", "60", "--ws-ping-timeout", "60"]
README.md CHANGED
@@ -87,13 +87,24 @@ the same kind of work.
87
 
88
  | Action | Reward | Signal |
89
  |--------|--------|--------|
90
- | `code` | 0.02 | Small reward for active exploration |
91
- | `submit` / `submit_file` | 0.0–1.0 | Graded against reference |
 
 
 
92
  | Max steps (15) | Episode ends | |
93
 
 
 
 
 
 
 
94
  **QA grading:** Numeric extraction with 5% tolerance + keyword overlap.
95
  **MODIFY grading:** 30% sheet-name match + 70% cell-level comparison (2% numeric tolerance).
96
 
 
 
97
  ## Setup & Usage
98
 
99
  ### Prerequisites
@@ -128,11 +139,91 @@ python inference.py
128
 
129
  ## Baseline Scores
130
 
131
- | Difficulty | Type | Expected Range |
132
- |------------|------|---------------|
133
- | Easy | QA | 0.60 – 1.00 |
134
- | Medium | MODIFY | 0.30 – 0.80 |
135
- | Hard | MODIFY | 0.10 – 0.60 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
 
137
  ## Project Structure
138
 
@@ -178,7 +269,7 @@ This environment models real financial spreadsheet work:
178
  - **Consolidation** β€” aggregate data across sheets into summary views
179
 
180
  Each task uses a genuine enterprise Excel workbook. MODIFY tasks are graded
181
- by cell-level comparison against a reference workbook.
182
 
183
  ## Acknowledgments
184
 
 
87
 
88
  | Action | Reward | Signal |
89
  |--------|--------|--------|
90
+ | `code` (failed) | 0.005 | Penalized β€” syntax/runtime error |
91
+ | `code` (simple) | ~0.02 | Minimal β€” just imports and a print |
92
+ | `code` (exploration) | ~0.05 | Good β€” reading data, producing output |
93
+ | `code` (modification + save) | ~0.06–0.10 | Best β€” actively editing the workbook |
94
+ | `submit` / `submit_file` | 0.001–0.999 | Full grading against reference |
95
  | Max steps (15) | Episode ends | |
96
 
97
+ Code step rewards are computed from:
98
+ - **Execution success** β€” failed code gets only 0.005
99
+ - **Substantive lines** β€” lines beyond imports/comments earn +0.002 each (up to +0.03)
100
+ - **Output produced** β€” printing data earns +0.001 per line (up to +0.02)
101
+ - **Save operations** β€” calling `.save()` earns +0.03 (agent is modifying the workbook)
102
+
103
  **QA grading:** Numeric extraction with 5% tolerance + keyword overlap.
104
  **MODIFY grading:** 30% sheet-name match + 70% cell-level comparison (2% numeric tolerance).
105
 
106
+ All scores are clamped to the open interval (0.001, 0.999).
107
+
108
  ## Setup & Usage
109
 
110
  ### Prerequisites
 
139
 
140
  ## Baseline Scores
141
 
142
+ The environment includes 10 tasks, but the baseline inference runs 5 representative
143
+ tasks (3 easy + 1 medium + 1 hard) to stay within the 20-minute runtime constraint.
144
+
145
+ **Model:** `MiniMaxAI/MiniMax-M2.5` via HuggingFace Router
146
+
147
+ | Task | Difficulty | Type | Score | Step Rewards |
148
+ |------|------------|------|-------|-------------|
149
+ | task_1 β€” Count Plants | Easy | QA | 0.001 | 0.05, 0.06, 0.06, 0.06, 0.00 |
150
+ | task_2 β€” Retrieve EOL Charge | Easy | QA | 0.001 | 0.04, 0.01, 0.07, 0.06, 0.02, 0.00 |
151
+ | task_3 β€” Portfolio MTM Change | Easy | QA | 0.367 | 0.06, 0.01, 0.07, ..., 0.37 |
152
+ | task_5 β€” Audit Formulas | Medium | MODIFY | **0.958** | 0.07, 0.01, 0.07, ..., 0.96 |
153
+ | task_8 β€” Balance Sheet Validation | Hard | MODIFY | 0.001 | 0.06, 0.01, 0.06, ..., 0.05 |
154
+ | **Average** | | | **0.266** | |
155
+
156
+ **Runtime:** 12 min 10 sec (limit: 20 min) Β· **Server memory:** ~40 MB (limit: 8 GB)
157
+
158
+ Note: Step rewards vary based on code quality β€” failed code gets 0.005, exploration
159
+ ~0.05, modification+save ~0.06–0.10.
160
+
161
+ ### Run 2 β€” `google/gemma-4-26B-A4B-it`
162
+
163
+ | Task | Difficulty | Type | Score |
164
+ |------|------------|------|-------|
165
+ | task_1 β€” Count Plants | Easy | QA | 0.001 |
166
+ | task_2 β€” Retrieve EOL Charge | Easy | QA | **0.999** |
167
+ | task_3 β€” Portfolio MTM Change | Easy | QA | 0.001 |
168
+ | task_5 β€” Audit Formulas | Medium | MODIFY | 0.001 |
169
+ | task_8 β€” Balance Sheet Validation | Hard | MODIFY | 0.001 |
170
+ | **Average** | | | **0.201** |
171
+
172
+ **Runtime:** 19 min 27 sec (limit: 20 min) Β· **Server memory:** ~40 MB
173
+
174
+ Gemma 4 26B solved task_2 perfectly in just 2 steps but timed out on more
175
+ complex tasks due to longer generation times.
176
+
177
+ ### Run 3 β€” `Qwen/Qwen3.5-122B-A10B`
178
+
179
+ | Task | Difficulty | Type | Score |
180
+ |------|------------|------|-------|
181
+ | task_1 β€” Count Plants | Easy | QA | 0.001 |
182
+ | task_2 β€” Retrieve EOL Charge | Easy | QA | **0.999** |
183
+ | task_3 β€” Portfolio MTM Change | Easy | QA | 0.001 |
184
+ | task_5 β€” Audit Formulas | Medium | MODIFY | 0.001 |
185
+ | task_8 β€” Balance Sheet Validation | Hard | MODIFY | 0.001 |
186
+ | **Average** | | | **0.201** |
187
+
188
+ **Runtime:** 2 min 11 sec Β· Fast inference but hit per-task timeout on complex tasks.
189
+
190
+ ### Run 4 β€” `deepseek-ai/DeepSeek-R1`
191
+
192
+ | Task | Difficulty | Type | Score |
193
+ |------|------------|------|-------|
194
+ | task_1 β€” Count Plants | Easy | QA | 0.001 |
195
+ | task_2 β€” Retrieve EOL Charge | Easy | QA | 0.001 |
196
+ | task_3 β€” Portfolio MTM Change | Easy | QA | 0.001 |
197
+ | task_5 β€” Audit Formulas | Medium | MODIFY | 0.001 |
198
+ | task_8 β€” Balance Sheet Validation | Hard | MODIFY | 0.001 |
199
+ | **Average** | | | **0.001** |
200
+
201
+ **Runtime:** 11 min 57 sec Β· DeepSeek-R1's long chain-of-thought reasoning consumed
202
+ most of the output tokens, leaving answers that didn't parse correctly.
203
+
204
+ ### Run 5 β€” `MiniMaxAI/MiniMax-M2.1` (Best)
205
+
206
+ | Task | Difficulty | Type | Score | Steps |
207
+ |------|------------|------|-------|-------|
208
+ | task_1 β€” Count Plants | Easy | QA | 0.001 | 5 |
209
+ | task_2 β€” Retrieve EOL Charge | Easy | QA | **0.999** | 4 |
210
+ | task_3 β€” Portfolio MTM Change | Easy | QA | 0.001 | 10 |
211
+ | task_5 β€” Audit Formulas | Medium | MODIFY | **0.958** | 4 |
212
+ | task_8 β€” Balance Sheet Validation | Hard | MODIFY | **0.733** | 10 |
213
+ | **Average** | | | **0.538** | |
214
+
215
+ **Runtime:** 3 min 18 sec Β· Best overall performance β€” solved 3/5 tasks with high
216
+ scores including the hard MODIFY task (0.733). Fast and efficient.
217
+
218
+ ### Model Comparison Summary
219
+
220
+ | Model | Avg Score | Runtime | Best Task |
221
+ |-------|-----------|---------|-----------|
222
+ | **MiniMax-M2.1** | **0.538** | **3m 18s** | task_5: 0.958, task_8: 0.733 |
223
+ | MiniMax-M2.5 | 0.266 | 12m 10s | task_5: 0.958 |
224
+ | Gemma 4 26B | 0.201 | 19m 27s | task_2: 0.999 |
225
+ | Qwen 3.5 122B | 0.201 | 2m 11s | task_2: 0.999 |
226
+ | DeepSeek-R1 | 0.001 | 11m 57s | β€” |
227
 
228
  ## Project Structure
229
 
 
269
  - **Consolidation** β€” aggregate data across sheets into summary views
270
 
271
  Each task uses a genuine enterprise Excel workbook. MODIFY tasks are graded
272
+ by spreadsheet properties comparison against a reference workbook.
273
 
274
  ## Acknowledgments
275
 
inference.py CHANGED
@@ -31,7 +31,7 @@ from openai import OpenAI
31
  # ---------------------------------------------------------------------------
32
 
33
  API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1")
34
- MODEL_NAME = os.environ.get("MODEL_NAME", "MiniMaxAI/MiniMax-M2.5")
35
  HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
36
  ENV_URL = os.environ.get("ENV_URL", "http://localhost:8000")
37
 
@@ -91,7 +91,7 @@ def log_start(task: str, env: str, model: str) -> None:
91
  def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
92
  done_val = str(done).lower()
93
  error_val = str(error).lower() if error else "none"
94
- short_action = action.replace("\n", " ")
95
  print(
96
  f"[STEP] step={step} action={short_action} reward={reward:.2f} done={done_val} error={error_val}",
97
  flush=True,
@@ -204,8 +204,12 @@ def _to_ws_url(http_url: str) -> str:
204
  return http_url.replace("https://", "wss://").replace("http://", "ws://")
205
 
206
 
 
 
 
207
  async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
208
  import websockets
 
209
 
210
  log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
211
 
@@ -213,9 +217,17 @@ async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
213
  steps_taken = 0
214
  final_score = 0.0
215
  success = False
 
216
 
217
  try:
218
- async with websockets.connect(f"{ws_url}/ws", open_timeout=30, max_size=100 * 1024 * 1024) as ws:
 
 
 
 
 
 
 
219
  # Reset
220
  reset_data = await ws_reset(ws, task_id)
221
  obs = reset_data["observation"]
@@ -235,6 +247,12 @@ async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
235
  ]
236
 
237
  for step_num in range(1, MAX_STEPS + 1):
 
 
 
 
 
 
238
  response = get_model_response(client, messages)
239
  if not response:
240
  break
 
31
  # ---------------------------------------------------------------------------
32
 
33
  API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1")
34
+ MODEL_NAME = os.environ.get("MODEL_NAME", "MiniMaxAI/MiniMax-M2.1")
35
  HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY")
36
  ENV_URL = os.environ.get("ENV_URL", "http://localhost:8000")
37
 
 
91
  def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
92
  done_val = str(done).lower()
93
  error_val = str(error).lower() if error else "none"
94
+ short_action = action[:500].replace("\n", " ")
95
  print(
96
  f"[STEP] step={step} action={short_action} reward={reward:.2f} done={done_val} error={error_val}",
97
  flush=True,
 
204
  return http_url.replace("https://", "wss://").replace("http://", "ws://")
205
 
206
 
207
+ TASK_TIMEOUT = 240 # 4 minutes per task (5 tasks Γ— 4 min = 20 min max)
208
+
209
+
210
  async def run_task(client: OpenAI, ws_url: str, task_id: str) -> float:
211
  import websockets
212
+ import time
213
 
214
  log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
215
 
 
217
  steps_taken = 0
218
  final_score = 0.0
219
  success = False
220
+ task_start = time.time()
221
 
222
  try:
223
+ async with websockets.connect(
224
+ f"{ws_url}/ws",
225
+ open_timeout=30,
226
+ close_timeout=10,
227
+ max_size=100 * 1024 * 1024,
228
+ ping_interval=60,
229
+ ping_timeout=60,
230
+ ) as ws:
231
  # Reset
232
  reset_data = await ws_reset(ws, task_id)
233
  obs = reset_data["observation"]
 
247
  ]
248
 
249
  for step_num in range(1, MAX_STEPS + 1):
250
+ # Check per-task timeout
251
+ elapsed = time.time() - task_start
252
+ if elapsed > TASK_TIMEOUT:
253
+ print(f"[DEBUG] Task {task_id} timeout after {elapsed:.0f}s (limit {TASK_TIMEOUT}s)", flush=True)
254
+ break
255
+
256
  response = get_model_response(client, messages)
257
  if not response:
258
  break