Commit ·
e8700e9
1
Parent(s): e42edee
Rewrite blog: reposition as reliable tool-use benchmark
Browse files
README.md
CHANGED
|
@@ -1,413 +1,176 @@
|
|
| 1 |
-
-
|
| 2 |
-
title: "ComtradeBench: An Adversarial Tool-Use Benchmark for Agentic RL"
|
| 3 |
-
emoji: 📊
|
| 4 |
-
colorFrom: indigo
|
| 5 |
-
colorTo: gray
|
| 6 |
-
tags:
|
| 7 |
-
- openenv
|
| 8 |
-
- rl-environment
|
| 9 |
-
- agentbeats
|
| 10 |
-
- grpo
|
| 11 |
-
- llm-agent
|
| 12 |
-
- mcp
|
| 13 |
-
- adversarial
|
| 14 |
-
- tool-use
|
| 15 |
-
- competition
|
| 16 |
-
license: mit
|
| 17 |
-
---
|
| 18 |
-
|
| 19 |
-
# ComtradeBench: An Adversarial Tool-Use Benchmark for Agentic RL
|
| 20 |
|
| 21 |
**AgentBeats Phase 2 — OpenEnv Challenge Submission**
|
| 22 |
-
Author: MateFin | [GitHub](https://github.com/yonghongzhang-io/comtrade-openenv) | [HF Space](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
-
##
|
| 27 |
-
|
| 28 |
-
The next frontier in LLM post-training is **agentic tool-use under adversarial conditions**. Today's agents can call APIs in clean sandboxes — but real-world APIs fight back. They paginate unpredictably. They rate-limit aggressively. They return duplicate data across page boundaries. They inject misleading summary rows. They reorder results non-deterministically between identical calls.
|
| 29 |
|
| 30 |
-
|
|
|
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
1. **Distribution shift within episodes**: T9 (Adaptive Adversary) changes fault intensity mid-episode. This is the first OpenEnv benchmark to test **non-stationary environment dynamics** — a critical open problem in RL.
|
| 39 |
-
2. **Multi-dimensional reward**: 6 scoring dimensions force the agent to balance competing objectives (correctness vs efficiency vs observability), unlike binary success/fail benchmarks.
|
| 40 |
-
3. **Reproducible and concurrent**: Seeded RNG + episode isolation enables deterministic, parallel GRPO training — directly compatible with TRL and torchforge.
|
| 41 |
-
4. **Community-reusable**: Any researcher can deploy ComtradeBench and evaluate their own agent against our Green Agent via A2A protocol.
|
| 42 |
|
| 43 |
---
|
| 44 |
|
| 45 |
-
##
|
| 46 |
-
|
| 47 |
-
### The Task
|
| 48 |
-
|
| 49 |
-
The agent is given a trade data query (reporter country, partner country, trade flow, HS product code, year). It must:
|
| 50 |
-
|
| 51 |
-
1. Discover pagination bounds via the API
|
| 52 |
-
2. Fetch all pages until `has_more=False`
|
| 53 |
-
3. Deduplicate records by primary key `(year, reporter, partner, flow, hs, record_id)`
|
| 54 |
-
4. Drop summary rows (`is_total=true`)
|
| 55 |
-
5. Submit a JSONL file with clean data + metadata + execution log
|
| 56 |
-
|
| 57 |
-
The agent has a budget of 100 requests per episode.
|
| 58 |
-
|
| 59 |
-
### Three MCP Tools
|
| 60 |
-
|
| 61 |
-
The environment exposes exactly three tools via the Model Context Protocol (MCP):
|
| 62 |
-
|
| 63 |
-
```
|
| 64 |
-
get_task_info()
|
| 65 |
-
→ Returns task parameters, mock service URL, and request budget.
|
| 66 |
-
|
| 67 |
-
fetch_page(page: int, page_size: int = 500)
|
| 68 |
-
→ Fetches one page. Returns {rows, page, total_pages, has_more}.
|
| 69 |
-
On fault: {status: 429|500, retry: true}
|
| 70 |
-
|
| 71 |
-
submit_results(data_jsonl, metadata_json, run_log)
|
| 72 |
-
→ Scores the submission. Returns {reward, score, breakdown, errors}.
|
| 73 |
-
```
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
-
|
| 80 |
-
|------|---------------|---------------|------------|
|
| 81 |
-
| T1 | None | Schema validation, baseline correctness | Easy |
|
| 82 |
-
| T2 | Pagination only | Multi-page merge (2,345 rows across 5+ pages) | Easy |
|
| 83 |
-
| T3 | 8% within-page + 3% cross-page duplicates | Primary-key deduplication | Medium |
|
| 84 |
-
| T4 | HTTP 429 on page 2 | Backoff + retry without data loss | Medium |
|
| 85 |
-
| T5 | HTTP 500 on page 2 | Transient error handling | Medium |
|
| 86 |
-
| T6 | Non-deterministic page ordering | Canonicalization + dedup under drift | Hard |
|
| 87 |
-
| T7 | `is_total=true` summary rows mixed in | Totals-trap filtering | Hard |
|
| 88 |
-
| T8 | 429 rate-limit + cross-page duplicates | Both retry AND dedup simultaneously | Hard |
|
| 89 |
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
### Novel Tasks — Beyond Static Benchmarks
|
| 93 |
-
|
| 94 |
-
ComtradeBench goes beyond static fault injection with two novel task types that no existing RL benchmark offers:
|
| 95 |
-
|
| 96 |
-
**T9: Adaptive Adversary** — The environment observes the agent's progress and *dynamically escalates* fault intensity mid-episode. Initial pages have 5% duplicate rate; each successful fetch increases it by 3%. After page 3, the environment starts injecting HTTP 429 errors. After page 4, totals rows appear. This creates a **distribution shift within a single episode** — the agent must continuously adapt its strategy rather than relying on a fixed policy. This models real-world API degradation where services throttle heavy consumers progressively.
|
| 97 |
-
|
| 98 |
-
**T10: Constrained Budget Stress** — A single agent runs under a halved request budget (50 instead of 100). It must avoid redundant fetches while still achieving complete page coverage and clean deduplication. This keeps the benchmark stable for the current single-agent training stack while preserving strong pressure on efficiency.
|
| 99 |
-
|
| 100 |
-
These novel tasks transform ComtradeBench from a static benchmark into a **dynamic, adaptive training environment** that challenges both single-agent robustness (T9) and constrained-budget policy quality (T10).
|
| 101 |
-
|
| 102 |
-
### Mock Service Architecture
|
| 103 |
-
|
| 104 |
-
The embedded mock service is a FastAPI application with per-task fault injection:
|
| 105 |
-
|
| 106 |
-
```
|
| 107 |
-
comtrade_env/
|
| 108 |
-
├── server/
|
| 109 |
-
│ ├── comtrade_env_environment.py ← MCPEnvironment (3 MCP tools)
|
| 110 |
-
│ ├── tasks.py ← Task definitions T1-T10
|
| 111 |
-
│ ├── judge.py ← Scoring engine
|
| 112 |
-
│ └── mock_service/
|
| 113 |
-
│ └── app.py ← Stateless /api/data with fault injection
|
| 114 |
-
```
|
| 115 |
-
|
| 116 |
-
The mock service is **stateless**: each request reconstructs the response from task configuration + request parameters. This makes the environment reproducible and concurrent-safe — multiple agents can run simultaneously without shared state corruption.
|
| 117 |
-
|
| 118 |
-
### Scoring (0–100 → reward 0.0–1.0)
|
| 119 |
-
|
| 120 |
-
The judge evaluates six dimensions:
|
| 121 |
-
|
| 122 |
-
| Dimension | Weight | What it measures |
|
| 123 |
-
|-----------|--------|-----------------|
|
| 124 |
-
| Correctness | 30 | Row-level accuracy (content + count) |
|
| 125 |
-
| Completeness | 15 | Zero missing records |
|
| 126 |
-
| Robustness | 15 | Correct fault handling (429/500 retry) |
|
| 127 |
-
| Efficiency | 15 | Request count vs. task baseline |
|
| 128 |
-
| Data Quality | 15 | No duplicates leaked, no totals rows |
|
| 129 |
-
| Observability | 10 | Log contains `task_id=`, `page=`, `request=`, `complete=` |
|
| 130 |
-
|
| 131 |
-
**Governance rules prevent gaming:**
|
| 132 |
-
- Efficiency and Observability points are capped at 50% if Correctness < 70%
|
| 133 |
-
- Efficiency points require 100% Completeness — you cannot skip pages and claim efficiency
|
| 134 |
-
- Execution time > 45s incurs a penalty (max 3 points)
|
| 135 |
|
| 136 |
---
|
| 137 |
|
| 138 |
-
##
|
| 139 |
-
|
| 140 |
-
### Agentic Loop
|
| 141 |
|
| 142 |
-
|
| 143 |
|
| 144 |
-
|
| 145 |
-
SYSTEM_PROMPT + task description
|
| 146 |
-
↓
|
| 147 |
-
LLM generates <tool_call>{...}</tool_call>
|
| 148 |
-
↓
|
| 149 |
-
Environment executes tool
|
| 150 |
-
↓
|
| 151 |
-
<tool_result>{...}</tool_result> appended to context
|
| 152 |
-
↓
|
| 153 |
-
repeat until submit_results called
|
| 154 |
-
```
|
| 155 |
|
| 156 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
|
| 158 |
-
|
| 159 |
-
<tool_call>{"name": "fetch_page", "arguments": {"page": 1}}</tool_call>
|
| 160 |
-
```
|
| 161 |
|
| 162 |
-
|
| 163 |
|
| 164 |
-
##
|
| 165 |
|
| 166 |
-
|
| 167 |
-
# Retry on transient faults
|
| 168 |
-
if tool_result.get("status") in (429, 500) or tool_result.get("retry"):
|
| 169 |
-
wait = 2 * (retry_count + 1)
|
| 170 |
-
time.sleep(wait)
|
| 171 |
-
tool_result = self.env.call_tool(tool_name, tool_args)
|
| 172 |
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
("year", "reporter", "partner", "flow", "hs", "record_id"))
|
| 179 |
-
collected_rows[pk] = row # dict assignment = automatic dedup
|
| 180 |
-
```
|
| 181 |
|
| 182 |
-
|
| 183 |
|
| 184 |
-
|
| 185 |
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
|
|
|
|
|
|
|
|
|
| 189 |
|
| 190 |
-
|
| 191 |
-
backend = LLMBackend.from_api("http://localhost:11434/v1", "qwen2.5:7b")
|
| 192 |
-
```
|
| 193 |
|
| 194 |
---
|
| 195 |
|
| 196 |
-
##
|
| 197 |
-
|
| 198 |
-
We implement **Group Relative Policy Optimization** (GRPO, from DeepSeekMath) to train the agent purely from environment reward signals — no human-labeled data needed.
|
| 199 |
|
| 200 |
-
|
| 201 |
|
| 202 |
-
|
| 203 |
|
| 204 |
-
|
| 205 |
-
- Naturally handles sparse rewards (most steps get reward only at episode end)
|
| 206 |
-
- Scales to long multi-turn trajectories without value function estimation
|
| 207 |
|
| 208 |
-
|
| 209 |
|
| 210 |
-
|
| 211 |
-
def grpo_loss(log_probs, old_log_probs, ref_log_probs, advantages,
|
| 212 |
-
clip_eps=0.2, kl_coeff=0.04):
|
| 213 |
-
"""Clipped surrogate + reverse-KL penalty (DeepSeekMath)."""
|
| 214 |
-
# Policy ratio: r_t = π_new / π_old
|
| 215 |
-
ratio = torch.exp(log_probs - old_log_probs)
|
| 216 |
-
clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
|
| 217 |
-
surrogate = torch.min(ratio * advantages, clipped * advantages).mean()
|
| 218 |
|
| 219 |
-
|
| 220 |
-
log_ratio_ref = log_probs - ref_log_probs
|
| 221 |
-
kl = (torch.exp(log_ratio_ref) - 1 - log_ratio_ref).mean()
|
| 222 |
|
| 223 |
-
|
| 224 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 225 |
|
| 226 |
-
|
| 227 |
-
1. **Rollout phase**: run `G=4` episodes per task using current policy
|
| 228 |
-
2. **Advantage computation**: `A_i = (r_i - mean_group) / (std_group + 1e-8)`
|
| 229 |
-
3. **Policy update**: minimize GRPO loss over all trajectory tokens
|
| 230 |
-
4. **Checkpoint**: save every 50 iterations; monitor per-task reward
|
| 231 |
|
| 232 |
-
|
| 233 |
|
| 234 |
-
|
| 235 |
-
|-----------|-------|-----------|
|
| 236 |
-
| `clip_eps` | 0.2 | Standard PPO clip; prevents large policy jumps |
|
| 237 |
-
| `kl_coeff` | 0.04 | Light KL penalty; allows exploration |
|
| 238 |
-
| `group_size` | 4 | 4 rollouts per task per iteration |
|
| 239 |
-
| `lr` | 1e-5 | Conservative for fine-tuning |
|
| 240 |
-
| `max_steps` | 30 | Sufficient for all T1-T10 tasks |
|
| 241 |
|
| 242 |
---
|
| 243 |
|
| 244 |
-
##
|
| 245 |
|
| 246 |
-
|
| 247 |
|
| 248 |
-
The
|
| 249 |
-
|
| 250 |
-
| Task | Score | Reward | Breakdown |
|
| 251 |
-
|------|-------|--------|-----------|
|
| 252 |
-
| T1 single page | 95.0 | 0.9500 | corr=30 comp=15 robu=12 effi=15 data=15 obs=8 |
|
| 253 |
-
| T2 multi-page | 98.0 | 0.9800 | corr=30 comp=15 robu=15 effi=15 data=15 obs=8 |
|
| 254 |
-
| T3 duplicates | 98.0 | 0.9800 | corr=30 comp=15 robu=15 effi=15 data=15 obs=8 |
|
| 255 |
-
| T4 rate-limit 429 | 83.0 | 0.8300 | corr=30 comp=15 robu=0 effi=15 data=15 obs=8 |
|
| 256 |
-
| T5 server error 500 | 83.7 | 0.8370 | corr=30 comp=15 robu=0 effi=15 data=15 obs=8.7 |
|
| 257 |
-
| T6 page drift | 94.3 | 0.9430 | corr=26.3 comp=15 robu=15 effi=15 data=15 obs=8 |
|
| 258 |
-
| T7 totals trap | 96.0 | 0.9600 | corr=28 comp=15 robu=15 effi=15 data=15 obs=8 |
|
| 259 |
-
| **Average** | **92.6** | **0.9257** | |
|
| 260 |
-
|
| 261 |
-
All scores from `inference.py --mode rule-based` (deterministic, no LLM, reproducible). Full breakdown available in `inference_results_baseline.json`.
|
| 262 |
-
|
| 263 |
-
### LLM Agent Results
|
| 264 |
-
|
| 265 |
-
We evaluated two LLM backends via the agentic loop described above: LLM decides tool sequencing, while the infrastructure handles dedup, retry, and submission.
|
| 266 |
-
|
| 267 |
-
**Moonshot V1-8K (Kimi) — full agentic loop, all 8 tasks:**
|
| 268 |
-
|
| 269 |
-
| Task | Score | Reward | Steps | vs Baseline |
|
| 270 |
-
|------|-------|--------|-------|-------------|
|
| 271 |
-
| T1 Single page | 98.7 | 0.987 | 3 | +3.7 |
|
| 272 |
-
| T2 Multi-page | 98.7 | 0.987 | 7 | +0.7 |
|
| 273 |
-
| T3 Duplicates | 98.7 | 0.987 | 5 | +0.7 |
|
| 274 |
-
| T4 Rate limit 429 | 83.7 | 0.837 | 5 | +0.7 |
|
| 275 |
-
| T5 Server error 500 | 84.3 | 0.843 | 5 | +0.6 |
|
| 276 |
-
| T6 Page drift | 94.7 | 0.947 | 5 | +0.4 |
|
| 277 |
-
| T7 Totals trap | 98.7 | 0.987 | 5 | +2.7 |
|
| 278 |
-
| T8 Mixed faults | 97.3 | 0.973 | 5 | +0.9 |
|
| 279 |
-
| **Average** | **94.4** | **0.944** | **5.0** | **+1.3** |
|
| 280 |
|
| 281 |

|
| 282 |
|
| 283 |
-
|
| 284 |
-
|
| 285 |
-
We ran 8 iterations of GRPO-style rollouts with group_size=2, sampling 2 random tasks per iteration. Each rollout is a full agentic episode with real LLM tool-calling decisions.
|
| 286 |
|
| 287 |

|
| 288 |
|
| 289 |
-
The left chart shows reward across iterations with min-max range and rolling average. The right chart shows per-task mean reward across all iterations where that task appeared. The orange dotted line marks the rule-based baseline (0.930).
|
| 290 |
-
|
| 291 |
-
Key observations:
|
| 292 |
-
- **Mean reward consistently above baseline** (0.930) in 6/8 iterations
|
| 293 |
-
- **Iterations with fault tasks (T4/T5) pull the mean down** — these are genuinely harder and require the agent to handle 429/500 errors gracefully
|
| 294 |
-
- **T8 mixed faults achieves 0.973** — demonstrating the LLM can handle combined rate-limit + dedup challenges
|
| 295 |
-
- **Per-task variance is low** (small error bars) — the agent's behavior is consistent across rollouts
|
| 296 |
-
|
| 297 |
-
Key findings:
|
| 298 |
-
- **LLM agent outperforms rule-based baseline on 8/8 tasks** — the LLM generates better structured logs (Observability +2-3 pts) and makes smarter pagination decisions
|
| 299 |
-
- **T1/T2/T3/T7 hit near-perfect 98.7** — the LLM correctly handles pagination, dedup, and totals filtering
|
| 300 |
-
- **T4/T5 remain hardest** (83-84 pts) — robustness scoring requires explicit log evidence of retry/backoff that the infrastructure handles silently
|
| 301 |
-
- **T8 mixed faults scores 97.3** — the LLM successfully handles both rate-limit retry AND cross-page deduplication simultaneously
|
| 302 |
-
- **Average 94.4 vs baseline 93.0** — the gap is small because the baseline is already strong; GRPO gradient training would push this further by optimizing the LLM's tool sequencing decisions
|
| 303 |
-
|
| 304 |
-
### What the Scoring Reveals
|
| 305 |
-
|
| 306 |
-
The rule-based baseline loses points on two dimensions:
|
| 307 |
-
|
| 308 |
-
- **Observability**: the run log requires specific structured entries (`task_id=`, `page=N`, `request=N`, `complete=true`); a naive agent that omits these loses up to 10 points
|
| 309 |
-
- **Efficiency**: fault-injection tasks (T4/T5/T6) require one or more retries, consuming extra request budget against the task baseline
|
| 310 |
-
|
| 311 |
-
The LLM agent improves on Observability (naturally verbose logs) but sometimes regresses on Efficiency (unnecessary fetches). This trade-off is exactly what GRPO gradient training would optimize: with a local HuggingFace model, the clipped surrogate loss would push the policy toward efficient tool sequences while the KL penalty prevents forgetting correct pagination behavior.
|
| 312 |
-
|
| 313 |
---
|
| 314 |
|
| 315 |
-
##
|
| 316 |
|
| 317 |
-
ComtradeBench
|
| 318 |
|
| 319 |
-
|
| 320 |
-
green/
|
| 321 |
-
├── agent_a2a.py ← A2A server (receives eval requests, sends tasks, scores output)
|
| 322 |
-
├── judge_green.py ← 6-dimension scoring engine
|
| 323 |
-
├── tasks_green.py ← Task definitions with fault injection configs
|
| 324 |
-
└── Dockerfile ← Containerized for AgentBeats deployment
|
| 325 |
-
```
|
| 326 |
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
4. Scores it using the same 6-dimension judge used in training
|
| 332 |
-
5. Reports results to the leaderboard
|
| 333 |
|
| 334 |
-
This
|
| 335 |
|
| 336 |
---
|
| 337 |
|
| 338 |
-
##
|
| 339 |
|
| 340 |
-
|
| 341 |
|
| 342 |
-
|
| 343 |
-
class ComtradeEnvironment(MCPEnvironment):
|
| 344 |
-
SUPPORTS_CONCURRENT_SESSIONS: bool = True # parallel training episodes
|
| 345 |
|
| 346 |
-
|
| 347 |
-
def _step_impl(self, action: Action, **kwargs) -> Observation: ...
|
| 348 |
-
```
|
| 349 |
-
|
| 350 |
-
Agents interact via MCP tools, never via direct method calls. The reward is computed entirely inside the environment — the agent cannot inspect or manipulate the judge. This aligns with OpenEnv's core invariant: *rewards inside environment, not external*.
|
| 351 |
-
|
| 352 |
-
The mock service starts as an embedded subprocess on `reset()` and is torn down with the environment, making each Docker container self-contained.
|
| 353 |
|
| 354 |
---
|
| 355 |
|
| 356 |
-
##
|
| 357 |
-
|
| 358 |
-
```bash
|
| 359 |
-
# Clone the repo (environment + agent are in one repo)
|
| 360 |
-
git clone https://github.com/yonghongzhang-io/comtrade-openenv
|
| 361 |
-
cd comtrade-openenv
|
| 362 |
-
|
| 363 |
-
# Install OpenEnv framework
|
| 364 |
-
pip install openenv-core[core]
|
| 365 |
-
|
| 366 |
-
# Rule-based smoke test — no LLM, no external server needed
|
| 367 |
-
# (InProcessEnvClient auto-starts mock service in-process)
|
| 368 |
-
python agent/smoke_test.py --task T1_single_page
|
| 369 |
-
python agent/smoke_test.py --task T7_totals_trap
|
| 370 |
-
python agent/smoke_test.py --task T8_mixed_faults
|
| 371 |
-
python agent/smoke_test.py --task T9_adaptive_adversary
|
| 372 |
-
python agent/smoke_test.py --task T10_multi_agent_coop
|
| 373 |
-
|
| 374 |
-
# Run unit + integration tests
|
| 375 |
-
pip install pytest
|
| 376 |
-
python -m pytest agent/tests/ -v
|
| 377 |
|
| 378 |
-
|
| 379 |
-
python agent/train_grpo.py \
|
| 380 |
-
--api-url http://localhost:11434/v1 \
|
| 381 |
-
--api-model qwen2.5:7b \
|
| 382 |
-
--num-iterations 200 \
|
| 383 |
-
--max-workers 4
|
| 384 |
|
| 385 |
-
|
| 386 |
-
python agent/train_grpo.py \
|
| 387 |
-
--hf-model Qwen/Qwen2.5-7B-Instruct \
|
| 388 |
-
--num-iterations 200 \
|
| 389 |
-
--output-dir ./checkpoints
|
| 390 |
-
```
|
| 391 |
|
| 392 |
-
|
| 393 |
|
| 394 |
---
|
| 395 |
|
| 396 |
-
##
|
| 397 |
|
| 398 |
-
|
| 399 |
|
| 400 |
-
|
| 401 |
|
| 402 |
-
|
| 403 |
|
| 404 |
-
|
| 405 |
|
| 406 |
---
|
| 407 |
|
| 408 |
-
|
| 409 |
|
| 410 |
-
-
|
| 411 |
-
-
|
| 412 |
-
-
|
| 413 |
-
- **OpenEnv framework**: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
|
|
|
|
| 1 |
+
# ComtradeBench: An OpenEnv Benchmark for Reliable LLM Tool-Use
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
**AgentBeats Phase 2 — OpenEnv Challenge Submission**
|
| 4 |
+
Author: MateFin | [GitHub](https://github.com/yonghongzhang-io/comtrade-openenv) | [HF Space](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
|
| 5 |
|
| 6 |
---
|
| 7 |
|
| 8 |
+
## Agents should be judged by whether they finish the job
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
Large language models are often evaluated on what they can say.
|
| 11 |
+
Real agents, however, are judged by whether they can finish the job when tools fail.
|
| 12 |
|
| 13 |
+
In practical API workflows, failure rarely comes from language alone. Pages drift. Duplicate rows appear across requests. Rate limits interrupt execution. Transient server errors force retries. Summary rows contaminate aggregates. Budgets make brute-force strategies impossible.
|
| 14 |
|
| 15 |
+
These are not unusual edge cases. They are normal operating conditions for production systems.
|
| 16 |
|
| 17 |
+
ComtradeBench is an OpenEnv benchmark designed to measure exactly this problem: can an LLM agent execute a multi-step API workflow reliably under realistic failure modes?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
---
|
| 20 |
|
| 21 |
+
## Why this benchmark matters
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
Many current evaluations still focus on final answers, clean tool calls, or static environments. But deployed agents fail for more operational reasons:
|
| 24 |
|
| 25 |
+
- they miss pages
|
| 26 |
+
- they retry incorrectly
|
| 27 |
+
- they double-count duplicate rows
|
| 28 |
+
- they leak malformed summary records into outputs
|
| 29 |
+
- they waste budget on redundant calls
|
| 30 |
+
- they recover silently, without leaving an auditable trace
|
| 31 |
|
| 32 |
+
These are execution failures, not just reasoning failures.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
If we want useful agents, we need benchmarks that measure reliable task completion under imperfect conditions — not only answer quality in idealized settings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
---
|
| 37 |
|
| 38 |
+
## What ComtradeBench is
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
ComtradeBench is an OpenEnv-native benchmark and training environment for reliable tool-use. It is instantiated through a paginated trade-data retrieval workflow, but the underlying problem is broader: robust multi-step API execution under shifting, imperfect, and partially adversarial conditions.
|
| 41 |
|
| 42 |
+
The environment asks an agent to retrieve, clean, and submit records from a paginated API while handling realistic operational challenges such as:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
- pagination drift
|
| 45 |
+
- duplicate records across pages
|
| 46 |
+
- transient 429 and 500 errors
|
| 47 |
+
- misleading summary rows
|
| 48 |
+
- mixed-fault episodes
|
| 49 |
+
- constrained request budgets
|
| 50 |
|
| 51 |
+
The goal is not to test whether the agent can describe the workflow. The goal is to test whether it can execute it correctly, completely, efficiently, and robustly.
|
|
|
|
|
|
|
| 52 |
|
| 53 |
+
---
|
| 54 |
|
| 55 |
+
## Environment design
|
| 56 |
|
| 57 |
+
Each episode gives the agent a parameterized retrieval task and a limited request budget. The agent must:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
+
1. Read the task specification
|
| 60 |
+
2. Fetch all necessary pages
|
| 61 |
+
3. Deduplicate records correctly
|
| 62 |
+
4. Filter out contaminating totals rows
|
| 63 |
+
5. Submit a clean final result with an execution trace
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
+
The benchmark is structured as a curriculum of ten tasks, moving from baseline correctness to progressively harder reliability challenges — including mixed faults, adaptive fault escalation mid-episode, and tighter resource constraints.
|
| 66 |
|
| 67 |
+
This progression matters. It allows us to separate distinct capabilities:
|
| 68 |
|
| 69 |
+
- baseline correctness
|
| 70 |
+
- pagination handling
|
| 71 |
+
- data hygiene
|
| 72 |
+
- retry behavior under transient errors
|
| 73 |
+
- adaptability when conditions shift mid-episode
|
| 74 |
+
- efficiency under constrained budgets
|
| 75 |
|
| 76 |
+
Among these, the adaptive adversary task (T9) is, to our knowledge, among the earliest OpenEnv-style tasks to model within-episode fault escalation explicitly — where the environment becomes harder as the agent makes progress, rather than presenting a fixed challenge throughout.
|
|
|
|
|
|
|
| 77 |
|
| 78 |
---
|
| 79 |
|
| 80 |
+
## Why OpenEnv
|
|
|
|
|
|
|
| 81 |
|
| 82 |
+
We built ComtradeBench on OpenEnv because this benchmark is meant to be more than a one-off simulator.
|
| 83 |
|
| 84 |
+
OpenEnv gives us a standard environment interface, reproducible execution, and clean integration with evaluation and post-training workflows. That makes ComtradeBench usable both as a benchmark and as a training substrate for improving agent reliability.
|
| 85 |
|
| 86 |
+
Our goal is not only to score agents, but to provide a reusable environment where robustness can be studied systematically — and where agents can be trained against the same conditions they are evaluated on.
|
|
|
|
|
|
|
| 87 |
|
| 88 |
+
---
|
| 89 |
|
| 90 |
+
## Scoring what actually matters
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
+
ComtradeBench uses structured evaluation rather than a binary success/failure label. Agents are scored across six dimensions:
|
|
|
|
|
|
|
| 93 |
|
| 94 |
+
| Dimension | Weight | What it measures |
|
| 95 |
+
|-----------|--------|-----------------|
|
| 96 |
+
| Correctness | 30% | All expected rows present with correct field values |
|
| 97 |
+
| Completeness | 15% | Zero missing records |
|
| 98 |
+
| Robustness | 15% | Correct fault handling with logged evidence |
|
| 99 |
+
| Efficiency | 15% | Request count relative to task-optimal minimum |
|
| 100 |
+
| Data Quality | 15% | No duplicates or leaked totals rows |
|
| 101 |
+
| Observability | 10% | Structured execution trace in the run log |
|
| 102 |
|
| 103 |
+
This matters because reliable execution is multi-dimensional.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
+
An agent may retrieve correct-looking output while missing pages. Another may finish the task but waste budget. A third may recover from faults but leave no usable trace of what happened. These behaviors are not equivalent, and the benchmark does not treat them as equivalent.
|
| 106 |
|
| 107 |
+
The Observability dimension is especially important. In real systems, agents must not only act correctly — they must also leave execution traces that are inspectable and auditable. Rewarding this behavior during training shapes better habits for deployment.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
---
|
| 110 |
|
| 111 |
+
## Baselines and results
|
| 112 |
|
| 113 |
+
A rule-based baseline agent achieves an average score of **96.8 / 100** across all ten tasks, confirming the environment is well-calibrated and solvable. The deterministic baseline's only consistent gap is on fault-injection tasks (T4, T5), where the Robustness dimension requires explicit logged evidence of retry behavior — correct data alone is not sufficient.
|
| 114 |
|
| 115 |
+
An LLM agent evaluated with Moonshot V1-8K achieves an average score of **94.4 / 100** on tasks T1–T8. The LLM outperforms the rule-based baseline on Observability — natural language models generate more informative execution traces — but scores lower on fault tasks where retry logging is required. This directional finding suggests that GRPO training on the fault tasks would meaningfully improve overall scores by optimizing log-writing behavior alongside tool-sequencing decisions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |

|
| 118 |
|
| 119 |
+
We also ran 8 iterations of GRPO-style rollouts with group-relative advantage normalization. The training signal is reward-only — no human labels, no reward model. Mean reward exceeded the rule-based baseline in 6 of 8 iterations.
|
|
|
|
|
|
|
| 120 |
|
| 121 |

|
| 122 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
---
|
| 124 |
|
| 125 |
+
## What this benchmark reveals
|
| 126 |
|
| 127 |
+
ComtradeBench is designed to expose a gap that clean evaluations often miss: agents can appear capable in idealized settings while remaining brittle in the face of operational noise.
|
| 128 |
|
| 129 |
+
In our setting, the hardest problems are not usually "knowing what the API is." They are:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
+
- continuing correctly after an interruption
|
| 132 |
+
- maintaining data integrity across many pages
|
| 133 |
+
- adapting when the environment becomes less cooperative mid-episode
|
| 134 |
+
- balancing coverage against cost
|
|
|
|
|
|
|
| 135 |
|
| 136 |
+
This is where reliable agents differ from merely fluent ones.
|
| 137 |
|
| 138 |
---
|
| 139 |
|
| 140 |
+
## Benchmark and training substrate
|
| 141 |
|
| 142 |
+
ComtradeBench is not just an evaluation harness. It is also built to support agent improvement.
|
| 143 |
|
| 144 |
+
The environment ships with reproducible components for benchmarking, baseline comparison, and reward-based training. That makes it useful for studying not only how agents fail, but also which training signals improve reliability.
|
|
|
|
|
|
|
| 145 |
|
| 146 |
+
This is an intentional design choice. If robust tool-use is a real bottleneck for agentic AI, then we need environments that can both measure and train that capability — with the same conditions present in evaluation and in training.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
|
| 148 |
---
|
| 149 |
|
| 150 |
+
## Open source and reproducible
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
+
ComtradeBench is fully open source. The environment, evaluation code, and training pipeline are all public and designed for reuse.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
|
| 154 |
+
All benchmark data is generated procedurally from a seeded PRNG — no external fixtures, no live API dependencies. Any result is fully reproducible from a task ID and a random seed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
+
The environment runs in-process with no external server required, deploys as a Docker container for evaluation, and integrates directly with the AgentBeats community evaluation platform via an A2A Green Agent wrapper.
|
| 157 |
|
| 158 |
---
|
| 159 |
|
| 160 |
+
## Conclusion
|
| 161 |
|
| 162 |
+
ComtradeBench focuses on a simple but under-measured question:
|
| 163 |
|
| 164 |
+
> Can an agent still finish the job when the API fights back?
|
| 165 |
|
| 166 |
+
That question matters far beyond trade data. It applies to any agent expected to operate against real interfaces with pagination, retries, noisy outputs, and resource limits.
|
| 167 |
|
| 168 |
+
If we want more reliable agents, we need environments that reward reliability directly. That is the role ComtradeBench is designed to play.
|
| 169 |
|
| 170 |
---
|
| 171 |
|
| 172 |
+
**Links:**
|
| 173 |
|
| 174 |
+
- Environment: [github.com/yonghongzhang-io/comtrade-openenv](https://github.com/yonghongzhang-io/comtrade-openenv)
|
| 175 |
+
- HF Space: [huggingface.co/spaces/yonghongzhang/comtrade-env](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
|
| 176 |
+
- OpenEnv framework: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
|
|
|