yonghongzhang commited on
Commit
e8700e9
·
1 Parent(s): e42edee

Rewrite blog: reposition as reliable tool-use benchmark

Browse files
Files changed (1) hide show
  1. README.md +92 -329
README.md CHANGED
@@ -1,413 +1,176 @@
1
- ---
2
- title: "ComtradeBench: An Adversarial Tool-Use Benchmark for Agentic RL"
3
- emoji: 📊
4
- colorFrom: indigo
5
- colorTo: gray
6
- tags:
7
- - openenv
8
- - rl-environment
9
- - agentbeats
10
- - grpo
11
- - llm-agent
12
- - mcp
13
- - adversarial
14
- - tool-use
15
- - competition
16
- license: mit
17
- ---
18
-
19
- # ComtradeBench: An Adversarial Tool-Use Benchmark for Agentic RL
20
 
21
  **AgentBeats Phase 2 — OpenEnv Challenge Submission**
22
- Author: MateFin | [GitHub](https://github.com/yonghongzhang-io/comtrade-openenv) | [HF Space](https://huggingface.co/spaces/yonghongzhang/comtrade-env) | [Blog](https://huggingface.co/yonghongzhang/ComtradeBench-Blog)
23
 
24
  ---
25
 
26
- ## Motivation
27
-
28
- The next frontier in LLM post-training is **agentic tool-use under adversarial conditions**. Today's agents can call APIs in clean sandboxes — but real-world APIs fight back. They paginate unpredictably. They rate-limit aggressively. They return duplicate data across page boundaries. They inject misleading summary rows. They reorder results non-deterministically between identical calls.
29
 
30
- These are not edge cases — they are the **default behavior** of production APIs at scale (AWS, Stripe, Bloomberg, UN Comtrade). Yet no existing RL benchmark systematically tests whether an LLM agent can handle them.
 
31
 
32
- **ComtradeBench fills this gap.** We built a 10-task OpenEnv environment with **adaptive fault injection** the first RL benchmark where the environment dynamically escalates difficulty based on the agent's own performance. This creates a fundamentally different training signal than static benchmarks: the agent cannot memorize a fixed policy, it must continuously adapt.
33
 
34
- We demonstrate this with a full GRPO training pipeline, real LLM evaluations (Moonshot V1-8K), and a Green Agent wrapper for community evaluation — all deployed live on HuggingFace Spaces.
35
 
36
- ### Why This Matters for RL Research
37
-
38
- 1. **Distribution shift within episodes**: T9 (Adaptive Adversary) changes fault intensity mid-episode. This is the first OpenEnv benchmark to test **non-stationary environment dynamics** — a critical open problem in RL.
39
- 2. **Multi-dimensional reward**: 6 scoring dimensions force the agent to balance competing objectives (correctness vs efficiency vs observability), unlike binary success/fail benchmarks.
40
- 3. **Reproducible and concurrent**: Seeded RNG + episode isolation enables deterministic, parallel GRPO training — directly compatible with TRL and torchforge.
41
- 4. **Community-reusable**: Any researcher can deploy ComtradeBench and evaluate their own agent against our Green Agent via A2A protocol.
42
 
43
  ---
44
 
45
- ## Environment Design
46
-
47
- ### The Task
48
-
49
- The agent is given a trade data query (reporter country, partner country, trade flow, HS product code, year). It must:
50
-
51
- 1. Discover pagination bounds via the API
52
- 2. Fetch all pages until `has_more=False`
53
- 3. Deduplicate records by primary key `(year, reporter, partner, flow, hs, record_id)`
54
- 4. Drop summary rows (`is_total=true`)
55
- 5. Submit a JSONL file with clean data + metadata + execution log
56
-
57
- The agent has a budget of 100 requests per episode.
58
-
59
- ### Three MCP Tools
60
-
61
- The environment exposes exactly three tools via the Model Context Protocol (MCP):
62
-
63
- ```
64
- get_task_info()
65
- → Returns task parameters, mock service URL, and request budget.
66
-
67
- fetch_page(page: int, page_size: int = 500)
68
- → Fetches one page. Returns {rows, page, total_pages, has_more}.
69
- On fault: {status: 429|500, retry: true}
70
-
71
- submit_results(data_jsonl, metadata_json, run_log)
72
- → Scores the submission. Returns {reward, score, breakdown, errors}.
73
- ```
74
 
75
- This minimal interface mirrors how real API agents are constrained: the agent cannot inspect internal state, cannot bypass pagination, and cannot retry with a fresh session.
76
 
77
- ### Eight Tasks — Progressive Difficulty
 
 
 
 
 
78
 
79
- | Task | Fault Injected | Key Challenge | Difficulty |
80
- |------|---------------|---------------|------------|
81
- | T1 | None | Schema validation, baseline correctness | Easy |
82
- | T2 | Pagination only | Multi-page merge (2,345 rows across 5+ pages) | Easy |
83
- | T3 | 8% within-page + 3% cross-page duplicates | Primary-key deduplication | Medium |
84
- | T4 | HTTP 429 on page 2 | Backoff + retry without data loss | Medium |
85
- | T5 | HTTP 500 on page 2 | Transient error handling | Medium |
86
- | T6 | Non-deterministic page ordering | Canonicalization + dedup under drift | Hard |
87
- | T7 | `is_total=true` summary rows mixed in | Totals-trap filtering | Hard |
88
- | T8 | 429 rate-limit + cross-page duplicates | Both retry AND dedup simultaneously | Hard |
89
 
90
- Tasks T1–T8 are drawn from real UN Comtrade API behaviors: pagination drift, duplicate records, and totals rows are documented failure modes that production ETL pipelines routinely encounter.
91
-
92
- ### Novel Tasks — Beyond Static Benchmarks
93
-
94
- ComtradeBench goes beyond static fault injection with two novel task types that no existing RL benchmark offers:
95
-
96
- **T9: Adaptive Adversary** — The environment observes the agent's progress and *dynamically escalates* fault intensity mid-episode. Initial pages have 5% duplicate rate; each successful fetch increases it by 3%. After page 3, the environment starts injecting HTTP 429 errors. After page 4, totals rows appear. This creates a **distribution shift within a single episode** — the agent must continuously adapt its strategy rather than relying on a fixed policy. This models real-world API degradation where services throttle heavy consumers progressively.
97
-
98
- **T10: Constrained Budget Stress** — A single agent runs under a halved request budget (50 instead of 100). It must avoid redundant fetches while still achieving complete page coverage and clean deduplication. This keeps the benchmark stable for the current single-agent training stack while preserving strong pressure on efficiency.
99
-
100
- These novel tasks transform ComtradeBench from a static benchmark into a **dynamic, adaptive training environment** that challenges both single-agent robustness (T9) and constrained-budget policy quality (T10).
101
-
102
- ### Mock Service Architecture
103
-
104
- The embedded mock service is a FastAPI application with per-task fault injection:
105
-
106
- ```
107
- comtrade_env/
108
- ├── server/
109
- │ ├── comtrade_env_environment.py ← MCPEnvironment (3 MCP tools)
110
- │ ├── tasks.py ← Task definitions T1-T10
111
- │ ├── judge.py ← Scoring engine
112
- │ └── mock_service/
113
- │ └── app.py ← Stateless /api/data with fault injection
114
- ```
115
-
116
- The mock service is **stateless**: each request reconstructs the response from task configuration + request parameters. This makes the environment reproducible and concurrent-safe — multiple agents can run simultaneously without shared state corruption.
117
-
118
- ### Scoring (0–100 → reward 0.0–1.0)
119
-
120
- The judge evaluates six dimensions:
121
-
122
- | Dimension | Weight | What it measures |
123
- |-----------|--------|-----------------|
124
- | Correctness | 30 | Row-level accuracy (content + count) |
125
- | Completeness | 15 | Zero missing records |
126
- | Robustness | 15 | Correct fault handling (429/500 retry) |
127
- | Efficiency | 15 | Request count vs. task baseline |
128
- | Data Quality | 15 | No duplicates leaked, no totals rows |
129
- | Observability | 10 | Log contains `task_id=`, `page=`, `request=`, `complete=` |
130
-
131
- **Governance rules prevent gaming:**
132
- - Efficiency and Observability points are capped at 50% if Correctness < 70%
133
- - Efficiency points require 100% Completeness — you cannot skip pages and claim efficiency
134
- - Execution time > 45s incurs a penalty (max 3 points)
135
 
136
  ---
137
 
138
- ## LLM Agent Design
139
-
140
- ### Agentic Loop
141
 
142
- The agent (`llm_agent/agent.py`) runs a standard tool-use loop:
143
 
144
- ```
145
- SYSTEM_PROMPT + task description
146
-
147
- LLM generates <tool_call>{...}</tool_call>
148
-
149
- Environment executes tool
150
-
151
- <tool_result>{...}</tool_result> appended to context
152
-
153
- repeat until submit_results called
154
- ```
155
 
156
- Tool calls use a lightweight XML format that works with any instruction-tuned model:
 
 
 
 
 
157
 
158
- ```xml
159
- <tool_call>{"name": "fetch_page", "arguments": {"page": 1}}</tool_call>
160
- ```
161
 
162
- The agent handles the protocol details (deduplication, retry on 429/500, totals filtering) in its loop logic, not by prompting the model to implement them. This keeps the model focused on **sequencing decisions** (which page to fetch next, when to submit) while the infrastructure handles correctness invariants.
163
 
164
- ### Fault Handling
165
 
166
- ```python
167
- # Retry on transient faults
168
- if tool_result.get("status") in (429, 500) or tool_result.get("retry"):
169
- wait = 2 * (retry_count + 1)
170
- time.sleep(wait)
171
- tool_result = self.env.call_tool(tool_name, tool_args)
172
 
173
- # Dedup + totals filter on every fetch_page
174
- for row in tool_result["rows"]:
175
- if row.get("is_total"):
176
- continue
177
- pk = "|".join(str(row.get(k, "")) for k in
178
- ("year", "reporter", "partner", "flow", "hs", "record_id"))
179
- collected_rows[pk] = row # dict assignment = automatic dedup
180
- ```
181
 
182
- ### Backend Flexibility
183
 
184
- The `LLMBackend` class supports two modes:
185
 
186
- ```python
187
- # Local HuggingFace model
188
- backend = LLMBackend.from_hf("Qwen/Qwen2.5-7B-Instruct")
 
 
 
189
 
190
- # OpenAI-compatible API (vLLM, Ollama, Together, etc.)
191
- backend = LLMBackend.from_api("http://localhost:11434/v1", "qwen2.5:7b")
192
- ```
193
 
194
  ---
195
 
196
- ## GRPO Training
197
-
198
- We implement **Group Relative Policy Optimization** (GRPO, from DeepSeekMath) to train the agent purely from environment reward signals — no human-labeled data needed.
199
 
200
- ### Why GRPO for Agentic Tasks
201
 
202
- Standard RLHF requires a separate reward model. GRPO replaces it with **group-relative normalization**: run `G` episodes per task, compute each episode's advantage as `(reward - group_mean) / group_std`. This:
203
 
204
- - Eliminates reward model training overhead
205
- - Naturally handles sparse rewards (most steps get reward only at episode end)
206
- - Scales to long multi-turn trajectories without value function estimation
207
 
208
- ### Implementation (`llm_agent/train_grpo.py`)
209
 
210
- ```python
211
- def grpo_loss(log_probs, old_log_probs, ref_log_probs, advantages,
212
- clip_eps=0.2, kl_coeff=0.04):
213
- """Clipped surrogate + reverse-KL penalty (DeepSeekMath)."""
214
- # Policy ratio: r_t = π_new / π_old
215
- ratio = torch.exp(log_probs - old_log_probs)
216
- clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
217
- surrogate = torch.min(ratio * advantages, clipped * advantages).mean()
218
 
219
- # Reverse KL: D_KL(π_new || π_ref) = E[exp(x) - 1 - x] where x = log(π_new/π_ref)
220
- log_ratio_ref = log_probs - ref_log_probs
221
- kl = (torch.exp(log_ratio_ref) - 1 - log_ratio_ref).mean()
222
 
223
- return -(surrogate - kl_coeff * kl)
224
- ```
 
 
 
 
 
 
225
 
226
- Training loop:
227
- 1. **Rollout phase**: run `G=4` episodes per task using current policy
228
- 2. **Advantage computation**: `A_i = (r_i - mean_group) / (std_group + 1e-8)`
229
- 3. **Policy update**: minimize GRPO loss over all trajectory tokens
230
- 4. **Checkpoint**: save every 50 iterations; monitor per-task reward
231
 
232
- ### Key Hyperparameters
233
 
234
- | Parameter | Value | Rationale |
235
- |-----------|-------|-----------|
236
- | `clip_eps` | 0.2 | Standard PPO clip; prevents large policy jumps |
237
- | `kl_coeff` | 0.04 | Light KL penalty; allows exploration |
238
- | `group_size` | 4 | 4 rollouts per task per iteration |
239
- | `lr` | 1e-5 | Conservative for fine-tuning |
240
- | `max_steps` | 30 | Sufficient for all T1-T10 tasks |
241
 
242
  ---
243
 
244
- ## Results
245
 
246
- ### Rule-Based Baseline (no LLM)
247
 
248
- The deterministic baseline agent in `smoke_test.py` achieves high scores on all tasks, validating the environment and scoring machinery end-to-end:
249
-
250
- | Task | Score | Reward | Breakdown |
251
- |------|-------|--------|-----------|
252
- | T1 single page | 95.0 | 0.9500 | corr=30 comp=15 robu=12 effi=15 data=15 obs=8 |
253
- | T2 multi-page | 98.0 | 0.9800 | corr=30 comp=15 robu=15 effi=15 data=15 obs=8 |
254
- | T3 duplicates | 98.0 | 0.9800 | corr=30 comp=15 robu=15 effi=15 data=15 obs=8 |
255
- | T4 rate-limit 429 | 83.0 | 0.8300 | corr=30 comp=15 robu=0 effi=15 data=15 obs=8 |
256
- | T5 server error 500 | 83.7 | 0.8370 | corr=30 comp=15 robu=0 effi=15 data=15 obs=8.7 |
257
- | T6 page drift | 94.3 | 0.9430 | corr=26.3 comp=15 robu=15 effi=15 data=15 obs=8 |
258
- | T7 totals trap | 96.0 | 0.9600 | corr=28 comp=15 robu=15 effi=15 data=15 obs=8 |
259
- | **Average** | **92.6** | **0.9257** | |
260
-
261
- All scores from `inference.py --mode rule-based` (deterministic, no LLM, reproducible). Full breakdown available in `inference_results_baseline.json`.
262
-
263
- ### LLM Agent Results
264
-
265
- We evaluated two LLM backends via the agentic loop described above: LLM decides tool sequencing, while the infrastructure handles dedup, retry, and submission.
266
-
267
- **Moonshot V1-8K (Kimi) — full agentic loop, all 8 tasks:**
268
-
269
- | Task | Score | Reward | Steps | vs Baseline |
270
- |------|-------|--------|-------|-------------|
271
- | T1 Single page | 98.7 | 0.987 | 3 | +3.7 |
272
- | T2 Multi-page | 98.7 | 0.987 | 7 | +0.7 |
273
- | T3 Duplicates | 98.7 | 0.987 | 5 | +0.7 |
274
- | T4 Rate limit 429 | 83.7 | 0.837 | 5 | +0.7 |
275
- | T5 Server error 500 | 84.3 | 0.843 | 5 | +0.6 |
276
- | T6 Page drift | 94.7 | 0.947 | 5 | +0.4 |
277
- | T7 Totals trap | 98.7 | 0.987 | 5 | +2.7 |
278
- | T8 Mixed faults | 97.3 | 0.973 | 5 | +0.9 |
279
- | **Average** | **94.4** | **0.944** | **5.0** | **+1.3** |
280
 
281
  ![Benchmark Results](benchmark_results.png)
282
 
283
- ### GRPO Rollout Training Curve (8 iterations, Moonshot V1-8K)
284
-
285
- We ran 8 iterations of GRPO-style rollouts with group_size=2, sampling 2 random tasks per iteration. Each rollout is a full agentic episode with real LLM tool-calling decisions.
286
 
287
  ![Training Curve](training_curve.png)
288
 
289
- The left chart shows reward across iterations with min-max range and rolling average. The right chart shows per-task mean reward across all iterations where that task appeared. The orange dotted line marks the rule-based baseline (0.930).
290
-
291
- Key observations:
292
- - **Mean reward consistently above baseline** (0.930) in 6/8 iterations
293
- - **Iterations with fault tasks (T4/T5) pull the mean down** — these are genuinely harder and require the agent to handle 429/500 errors gracefully
294
- - **T8 mixed faults achieves 0.973** — demonstrating the LLM can handle combined rate-limit + dedup challenges
295
- - **Per-task variance is low** (small error bars) — the agent's behavior is consistent across rollouts
296
-
297
- Key findings:
298
- - **LLM agent outperforms rule-based baseline on 8/8 tasks** — the LLM generates better structured logs (Observability +2-3 pts) and makes smarter pagination decisions
299
- - **T1/T2/T3/T7 hit near-perfect 98.7** — the LLM correctly handles pagination, dedup, and totals filtering
300
- - **T4/T5 remain hardest** (83-84 pts) — robustness scoring requires explicit log evidence of retry/backoff that the infrastructure handles silently
301
- - **T8 mixed faults scores 97.3** — the LLM successfully handles both rate-limit retry AND cross-page deduplication simultaneously
302
- - **Average 94.4 vs baseline 93.0** — the gap is small because the baseline is already strong; GRPO gradient training would push this further by optimizing the LLM's tool sequencing decisions
303
-
304
- ### What the Scoring Reveals
305
-
306
- The rule-based baseline loses points on two dimensions:
307
-
308
- - **Observability**: the run log requires specific structured entries (`task_id=`, `page=N`, `request=N`, `complete=true`); a naive agent that omits these loses up to 10 points
309
- - **Efficiency**: fault-injection tasks (T4/T5/T6) require one or more retries, consuming extra request budget against the task baseline
310
-
311
- The LLM agent improves on Observability (naturally verbose logs) but sometimes regresses on Efficiency (unnecessary fetches). This trade-off is exactly what GRPO gradient training would optimize: with a local HuggingFace model, the clipped surrogate loss would push the policy toward efficient tool sequences while the KL penalty prevents forgetting correct pagination behavior.
312
-
313
  ---
314
 
315
- ## Green Agent Wrapper
316
 
317
- ComtradeBench includes a **Green Agent** the evaluator component for the AgentBeats competition platform. The Green Agent implements the A2A (Agent-to-Agent) JSON-RPC 2.0 protocol and serves as the referee that Purple agents compete against.
318
 
319
- ```
320
- green/
321
- ├── agent_a2a.py ← A2A server (receives eval requests, sends tasks, scores output)
322
- ├── judge_green.py ← 6-dimension scoring engine
323
- ├── tasks_green.py ← Task definitions with fault injection configs
324
- └── Dockerfile ← Containerized for AgentBeats deployment
325
- ```
326
 
327
- The Green Agent:
328
- 1. Receives an evaluation request from AgentBeats
329
- 2. Sends tasks (T1-T10) to the Purple agent via A2A protocol
330
- 3. Collects the Purple agent's data output
331
- 4. Scores it using the same 6-dimension judge used in training
332
- 5. Reports results to the leaderboard
333
 
334
- This enables **any team's Purple agent** to be evaluated against ComtradeBench — making the environment a reusable benchmark for the broader community.
335
 
336
  ---
337
 
338
- ## OpenEnv Integration
339
 
340
- The environment follows the OpenEnv contract exactly:
341
 
342
- ```python
343
- class ComtradeEnvironment(MCPEnvironment):
344
- SUPPORTS_CONCURRENT_SESSIONS: bool = True # parallel training episodes
345
 
346
- def reset(self, task_id=None, seed=None, **kwargs) -> Observation: ...
347
- def _step_impl(self, action: Action, **kwargs) -> Observation: ...
348
- ```
349
-
350
- Agents interact via MCP tools, never via direct method calls. The reward is computed entirely inside the environment — the agent cannot inspect or manipulate the judge. This aligns with OpenEnv's core invariant: *rewards inside environment, not external*.
351
-
352
- The mock service starts as an embedded subprocess on `reset()` and is torn down with the environment, making each Docker container self-contained.
353
 
354
  ---
355
 
356
- ## Running the Environment
357
-
358
- ```bash
359
- # Clone the repo (environment + agent are in one repo)
360
- git clone https://github.com/yonghongzhang-io/comtrade-openenv
361
- cd comtrade-openenv
362
-
363
- # Install OpenEnv framework
364
- pip install openenv-core[core]
365
-
366
- # Rule-based smoke test — no LLM, no external server needed
367
- # (InProcessEnvClient auto-starts mock service in-process)
368
- python agent/smoke_test.py --task T1_single_page
369
- python agent/smoke_test.py --task T7_totals_trap
370
- python agent/smoke_test.py --task T8_mixed_faults
371
- python agent/smoke_test.py --task T9_adaptive_adversary
372
- python agent/smoke_test.py --task T10_multi_agent_coop
373
-
374
- # Run unit + integration tests
375
- pip install pytest
376
- python -m pytest agent/tests/ -v
377
 
378
- # Train with GRPO via local Ollama/vLLM (rollout-only, no GPU required)
379
- python agent/train_grpo.py \
380
- --api-url http://localhost:11434/v1 \
381
- --api-model qwen2.5:7b \
382
- --num-iterations 200 \
383
- --max-workers 4
384
 
385
- # Train with gradient updates (requires GPU + HuggingFace model)
386
- python agent/train_grpo.py \
387
- --hf-model Qwen/Qwen2.5-7B-Instruct \
388
- --num-iterations 200 \
389
- --output-dir ./checkpoints
390
- ```
391
 
392
- No external OpenEnv server is needed `InProcessEnvClient` wraps the environment directly, with parallel rollout support via `ThreadPoolExecutor`.
393
 
394
  ---
395
 
396
- ## Design Decisions and Lessons Learned
397
 
398
- **Stateless mock service is essential.** The first implementation used per-session state in the mock service, which caused race conditions when multiple agents ran concurrently during GRPO rollouts. Switching to stateless `/api/data` with per-task `_API_STATE` dictionaries eliminated the issue entirely.
399
 
400
- **Three tools is the right abstraction.** Early prototypes had separate tools for setting query parameters and for pagination. Collapsing to `get_task_info` + `fetch_page` + `submit_results` reduced token overhead and made the tool-use pattern easier for the model to learn.
401
 
402
- **Protocol-level dedup beats prompt-level dedup.** Telling the model "deduplicate records" in the system prompt is fragile the model may not track state correctly across long contexts. Instead, the agent loop handles dedup mechanically using a Python dict keyed by primary key. The model only needs to decide *when* to call which tool.
403
 
404
- **Observability scoring drives good agent habits.** The 10-point observability dimension, which requires structured log entries (`task_id=`, `page=N`, `request=N`, `complete=true`), incentivizes the agent to maintain explicit execution state. This is valuable beyond scoring: structured logs are how real ETL pipelines are debugged.
405
 
406
  ---
407
 
408
- ## Links
409
 
410
- - **Environment**: [github.com/yonghongzhang-io/comtrade-openenv](https://github.com/yonghongzhang-io/comtrade-openenv)
411
- - **HF Space**: [huggingface.co/spaces/yonghongzhang/comtrade-env](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
412
- - **Full competition repo**: [github.com/yonghongzhang-io/AIAgentCompetition-phase2](https://github.com/yonghongzhang-io/AIAgentCompetition-phase2)
413
- - **OpenEnv framework**: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
 
1
+ # ComtradeBench: An OpenEnv Benchmark for Reliable LLM Tool-Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  **AgentBeats Phase 2 — OpenEnv Challenge Submission**
4
+ Author: MateFin | [GitHub](https://github.com/yonghongzhang-io/comtrade-openenv) | [HF Space](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
5
 
6
  ---
7
 
8
+ ## Agents should be judged by whether they finish the job
 
 
9
 
10
+ Large language models are often evaluated on what they can say.
11
+ Real agents, however, are judged by whether they can finish the job when tools fail.
12
 
13
+ In practical API workflows, failure rarely comes from language alone. Pages drift. Duplicate rows appear across requests. Rate limits interrupt execution. Transient server errors force retries. Summary rows contaminate aggregates. Budgets make brute-force strategies impossible.
14
 
15
+ These are not unusual edge cases. They are normal operating conditions for production systems.
16
 
17
+ ComtradeBench is an OpenEnv benchmark designed to measure exactly this problem: can an LLM agent execute a multi-step API workflow reliably under realistic failure modes?
 
 
 
 
 
18
 
19
  ---
20
 
21
+ ## Why this benchmark matters
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
+ Many current evaluations still focus on final answers, clean tool calls, or static environments. But deployed agents fail for more operational reasons:
24
 
25
+ - they miss pages
26
+ - they retry incorrectly
27
+ - they double-count duplicate rows
28
+ - they leak malformed summary records into outputs
29
+ - they waste budget on redundant calls
30
+ - they recover silently, without leaving an auditable trace
31
 
32
+ These are execution failures, not just reasoning failures.
 
 
 
 
 
 
 
 
 
33
 
34
+ If we want useful agents, we need benchmarks that measure reliable task completion under imperfect conditions not only answer quality in idealized settings.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ---
37
 
38
+ ## What ComtradeBench is
 
 
39
 
40
+ ComtradeBench is an OpenEnv-native benchmark and training environment for reliable tool-use. It is instantiated through a paginated trade-data retrieval workflow, but the underlying problem is broader: robust multi-step API execution under shifting, imperfect, and partially adversarial conditions.
41
 
42
+ The environment asks an agent to retrieve, clean, and submit records from a paginated API while handling realistic operational challenges such as:
 
 
 
 
 
 
 
 
 
 
43
 
44
+ - pagination drift
45
+ - duplicate records across pages
46
+ - transient 429 and 500 errors
47
+ - misleading summary rows
48
+ - mixed-fault episodes
49
+ - constrained request budgets
50
 
51
+ The goal is not to test whether the agent can describe the workflow. The goal is to test whether it can execute it correctly, completely, efficiently, and robustly.
 
 
52
 
53
+ ---
54
 
55
+ ## Environment design
56
 
57
+ Each episode gives the agent a parameterized retrieval task and a limited request budget. The agent must:
 
 
 
 
 
58
 
59
+ 1. Read the task specification
60
+ 2. Fetch all necessary pages
61
+ 3. Deduplicate records correctly
62
+ 4. Filter out contaminating totals rows
63
+ 5. Submit a clean final result with an execution trace
 
 
 
64
 
65
+ The benchmark is structured as a curriculum of ten tasks, moving from baseline correctness to progressively harder reliability challenges — including mixed faults, adaptive fault escalation mid-episode, and tighter resource constraints.
66
 
67
+ This progression matters. It allows us to separate distinct capabilities:
68
 
69
+ - baseline correctness
70
+ - pagination handling
71
+ - data hygiene
72
+ - retry behavior under transient errors
73
+ - adaptability when conditions shift mid-episode
74
+ - efficiency under constrained budgets
75
 
76
+ Among these, the adaptive adversary task (T9) is, to our knowledge, among the earliest OpenEnv-style tasks to model within-episode fault escalation explicitly — where the environment becomes harder as the agent makes progress, rather than presenting a fixed challenge throughout.
 
 
77
 
78
  ---
79
 
80
+ ## Why OpenEnv
 
 
81
 
82
+ We built ComtradeBench on OpenEnv because this benchmark is meant to be more than a one-off simulator.
83
 
84
+ OpenEnv gives us a standard environment interface, reproducible execution, and clean integration with evaluation and post-training workflows. That makes ComtradeBench usable both as a benchmark and as a training substrate for improving agent reliability.
85
 
86
+ Our goal is not only to score agents, but to provide a reusable environment where robustness can be studied systematically — and where agents can be trained against the same conditions they are evaluated on.
 
 
87
 
88
+ ---
89
 
90
+ ## Scoring what actually matters
 
 
 
 
 
 
 
91
 
92
+ ComtradeBench uses structured evaluation rather than a binary success/failure label. Agents are scored across six dimensions:
 
 
93
 
94
+ | Dimension | Weight | What it measures |
95
+ |-----------|--------|-----------------|
96
+ | Correctness | 30% | All expected rows present with correct field values |
97
+ | Completeness | 15% | Zero missing records |
98
+ | Robustness | 15% | Correct fault handling with logged evidence |
99
+ | Efficiency | 15% | Request count relative to task-optimal minimum |
100
+ | Data Quality | 15% | No duplicates or leaked totals rows |
101
+ | Observability | 10% | Structured execution trace in the run log |
102
 
103
+ This matters because reliable execution is multi-dimensional.
 
 
 
 
104
 
105
+ An agent may retrieve correct-looking output while missing pages. Another may finish the task but waste budget. A third may recover from faults but leave no usable trace of what happened. These behaviors are not equivalent, and the benchmark does not treat them as equivalent.
106
 
107
+ The Observability dimension is especially important. In real systems, agents must not only act correctly — they must also leave execution traces that are inspectable and auditable. Rewarding this behavior during training shapes better habits for deployment.
 
 
 
 
 
 
108
 
109
  ---
110
 
111
+ ## Baselines and results
112
 
113
+ A rule-based baseline agent achieves an average score of **96.8 / 100** across all ten tasks, confirming the environment is well-calibrated and solvable. The deterministic baseline's only consistent gap is on fault-injection tasks (T4, T5), where the Robustness dimension requires explicit logged evidence of retry behavior — correct data alone is not sufficient.
114
 
115
+ An LLM agent evaluated with Moonshot V1-8K achieves an average score of **94.4 / 100** on tasks T1–T8. The LLM outperforms the rule-based baseline on Observability natural language models generate more informative execution traces — but scores lower on fault tasks where retry logging is required. This directional finding suggests that GRPO training on the fault tasks would meaningfully improve overall scores by optimizing log-writing behavior alongside tool-sequencing decisions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
  ![Benchmark Results](benchmark_results.png)
118
 
119
+ We also ran 8 iterations of GRPO-style rollouts with group-relative advantage normalization. The training signal is reward-only — no human labels, no reward model. Mean reward exceeded the rule-based baseline in 6 of 8 iterations.
 
 
120
 
121
  ![Training Curve](training_curve.png)
122
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  ---
124
 
125
+ ## What this benchmark reveals
126
 
127
+ ComtradeBench is designed to expose a gap that clean evaluations often miss: agents can appear capable in idealized settings while remaining brittle in the face of operational noise.
128
 
129
+ In our setting, the hardest problems are not usually "knowing what the API is." They are:
 
 
 
 
 
 
130
 
131
+ - continuing correctly after an interruption
132
+ - maintaining data integrity across many pages
133
+ - adapting when the environment becomes less cooperative mid-episode
134
+ - balancing coverage against cost
 
 
135
 
136
+ This is where reliable agents differ from merely fluent ones.
137
 
138
  ---
139
 
140
+ ## Benchmark and training substrate
141
 
142
+ ComtradeBench is not just an evaluation harness. It is also built to support agent improvement.
143
 
144
+ The environment ships with reproducible components for benchmarking, baseline comparison, and reward-based training. That makes it useful for studying not only how agents fail, but also which training signals improve reliability.
 
 
145
 
146
+ This is an intentional design choice. If robust tool-use is a real bottleneck for agentic AI, then we need environments that can both measure and train that capability — with the same conditions present in evaluation and in training.
 
 
 
 
 
 
147
 
148
  ---
149
 
150
+ ## Open source and reproducible
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
 
152
+ ComtradeBench is fully open source. The environment, evaluation code, and training pipeline are all public and designed for reuse.
 
 
 
 
 
153
 
154
+ All benchmark data is generated procedurally from a seeded PRNG — no external fixtures, no live API dependencies. Any result is fully reproducible from a task ID and a random seed.
 
 
 
 
 
155
 
156
+ The environment runs in-process with no external server required, deploys as a Docker container for evaluation, and integrates directly with the AgentBeats community evaluation platform via an A2A Green Agent wrapper.
157
 
158
  ---
159
 
160
+ ## Conclusion
161
 
162
+ ComtradeBench focuses on a simple but under-measured question:
163
 
164
+ > Can an agent still finish the job when the API fights back?
165
 
166
+ That question matters far beyond trade data. It applies to any agent expected to operate against real interfaces with pagination, retries, noisy outputs, and resource limits.
167
 
168
+ If we want more reliable agents, we need environments that reward reliability directly. That is the role ComtradeBench is designed to play.
169
 
170
  ---
171
 
172
+ **Links:**
173
 
174
+ - Environment: [github.com/yonghongzhang-io/comtrade-openenv](https://github.com/yonghongzhang-io/comtrade-openenv)
175
+ - HF Space: [huggingface.co/spaces/yonghongzhang/comtrade-env](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
176
+ - OpenEnv framework: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)