ComtradeBench-Blog / README.md
yonghongzhang's picture
Fix conclusion highlight: use HF-native blockquote style
bc5bc0d
metadata
tags:
  - benchmark
  - tool-use
  - openenv
  - rl-environment
  - adversarial
  - grpo
language:
  - en

ComtradeBench β€” An OpenEnv Benchmark for Reliable LLM Tool-Use

GitHub   HF Space   OpenEnv   10 Tasks   GRPO

AgentBeats Phase 2 β€” OpenEnv Challenge Submission  |  Author: MateFin


Agents should be judged by whether they finish the job

Large language models are often evaluated on what they can say.
Real agents, however, are judged by whether they can finish the job when tools fail.

In practical API workflows, failure rarely comes from language alone. Pages drift. Duplicate rows appear across requests. Rate limits interrupt execution. Transient server errors force retries. Summary rows contaminate aggregates. Budgets make brute-force strategies impossible.

These are not unusual edge cases. They are normal operating conditions for production systems.

ComtradeBench is an OpenEnv benchmark designed to measure exactly this problem: can an LLM agent execute a multi-step API workflow reliably under realistic failure modes?


Why this benchmark matters

Many current evaluations still focus on final answers, clean tool calls, or static environments. But deployed agents fail for more operational reasons:

Failure What goes wrong
Miss pages Incomplete data submitted as complete
Retry incorrectly Page skipped after error β€” silent data gap
Double-count duplicates Overcounted rows, inflated aggregates
Leak summary rows Contaminated totals corrupt downstream analysis
Waste budget Redundant fetches exhaust request limit
Recover silently No auditable trace β€” failure invisible in production

These are execution failures, not just reasoning failures.

If we want useful agents, we need benchmarks that measure reliable task completion under imperfect conditions β€” not only answer quality in idealized settings.


What ComtradeBench is

ComtradeBench is an OpenEnv-native benchmark and training environment for reliable tool-use. The domain is trade-data retrieval; the problem is broader: robust multi-step API execution under shifting, imperfect, and partially adversarial conditions.

The environment asks an agent to retrieve, clean, and submit records from a paginated API while handling:

  • Pagination drift β€” page ordering randomized between calls
  • Duplicate records β€” within-page (8%) and cross-page (3%) overlap
  • Transient errors β€” HTTP 429 rate-limits and HTTP 500 server faults
  • Totals trap β€” synthetic summary rows mixed into real data
  • Mixed faults β€” rate-limit retry + dedup simultaneously
  • Constrained budget β€” halved request limit, no room for waste

The goal is not to test whether the agent can describe the workflow.
The goal is to test whether it can execute it β€” correctly, completely, efficiently, and robustly.


Environment design

Each episode gives the agent a parameterized retrieval task and a limited request budget. The agent interacts through three MCP tools only:

get_task_info()         β†’  task parameters + request budget
fetch_page(page, size)  β†’  {rows, has_more}  or  {status: 429|500, retry: true}
submit_results(...)     β†’  {reward, score, breakdown}

The benchmark is structured as a curriculum of ten tasks:

# Task Core challenge
T1 Single page Baseline correctness
T2 Multi-page pagination Merge 2,345+ rows across pages
T3 Duplicates Primary-key deduplication
T4 HTTP 429 Backoff + retry without data loss
T5 HTTP 500 Transient error recovery
T6 Page drift Canonicalize under non-deterministic ordering
T7 Totals trap Filter is_total=true rows
T8 Mixed faults Retry AND dedup simultaneously
T9 Adaptive adversary Fault intensity escalates mid-episode
T10 Constrained budget 50 requests instead of 100

T9 is, to our knowledge, among the earliest OpenEnv-style tasks to model within-episode fault escalation β€” where the environment becomes harder as the agent makes progress.


Why OpenEnv

We built ComtradeBench on OpenEnv because this benchmark is meant to be more than a one-off simulator.

OpenEnv gives us a standard environment interface, reproducible execution, and clean integration with evaluation and post-training workflows. The same environment code runs both in-process during GRPO training and as a deployed Docker service during evaluation β€” with no divergence.

Our goal is not only to score agents, but to provide a reusable environment where robustness can be studied and trained systematically.


Scoring what actually matters

ComtradeBench uses structured evaluation across six dimensions β€” not a binary pass/fail:

Dimension Weight What it measures
Correctness 30% All expected rows present with correct field values
Completeness 15% Zero missing records
Robustness 15% Correct fault handling with logged evidence
Efficiency 15% Request count vs. task-optimal minimum
Data Quality 15% No duplicates or leaked totals rows
Observability 10% Structured execution trace in the run log

Why multi-dimensional scoring matters:
An agent that retrieves correct data but skips retry logging loses 15 points on Robustness. An agent that skips pages to save budget loses Completeness and all Efficiency credit. These behaviors are not equivalent β€” the benchmark does not treat them as equivalent.

The Observability dimension deserves special note: requiring structured log entries incentivizes the agent to maintain explicit execution state. This is not artificial β€” structured logs are how production ETL pipelines are monitored and debugged.


Baselines and results

Rule-based baseline (no LLM)

A deterministic rule-based agent achieves 96.8 / 100 average across all ten tasks, confirming the environment is well-calibrated and solvable.

Task Score Reward
T1 Single page 98.0 0.980
T2 Multi-page 98.0 0.980
T3 Duplicates 98.0 0.980
T4 Rate limit (429) 95.0 0.950
T5 Server error (500) 95.7 0.957
T6 Page drift 94.0 0.940
T7 Totals trap 98.0 0.980
T8 Mixed faults 96.4 0.964
T9 Adaptive adversary 96.9 0.969
T10 Constrained budget 98.0 0.980
Average 96.8 0.968

LLM agent β€” Moonshot V1-8K (Kimi API)

Task Score Reward
T1 Single page 98.7 0.987
T2 Multi-page 98.7 0.987
T3 Duplicates 98.7 0.987
T4 Rate limit 83.7 0.837
T5 Server error 84.3 0.843
T6 Page drift 94.7 0.947
T7 Totals trap 98.7 0.987
T8 Mixed faults 97.3 0.973
Average 94.4 0.944

The LLM outperforms the rule-based baseline on Observability β€” natural language models generate more informative execution traces. The gap on T4/T5 reflects that the Robustness dimension requires explicit logged evidence of retry behavior, not just correct output.

GRPO training curve

We ran 8 iterations of GRPO-style rollouts with group-relative advantage normalization. Training signal is reward-only β€” no human labels, no reward model. Mean reward exceeded the rule-based baseline in 6 of 8 iterations.

GRPO Training Curve


What this benchmark reveals

ComtradeBench is designed to expose a gap that clean evaluations often miss: agents can appear capable in idealized settings while remaining brittle under operational noise.

The hardest problems are not "knowing what the API is." They are:

  • continuing correctly after an interruption
  • maintaining data integrity across many pages
  • adapting when conditions shift mid-episode
  • balancing coverage against cost

This is where reliable agents differ from merely fluent ones.


Benchmark and training substrate

ComtradeBench is not just an evaluation harness β€” it is built to support agent improvement.

The environment ships with a full GRPO training pipeline: reproducible rollouts, group-relative advantage normalization, and reward-only optimization. No human labels needed. No separate reward model.

This is an intentional design choice: if robust tool-use is a real bottleneck for agentic AI, we need environments that can both measure and train that capability β€” with identical conditions in evaluation and training.


Quick start

# No LLM, no GPU, no API key required
git clone https://github.com/yonghongzhang-io/comtrade-openenv
pip install openenv-core[core]
python agent/smoke_test.py --task T1_single_page
python agent/smoke_test.py --task T9_adaptive_adversary

# GRPO training via local Ollama (CPU-capable)
python agent/train_grpo.py \
    --api-url http://localhost:11434/v1 \
    --api-model qwen2.5:7b \
    --num-iterations 200 --group-size 4

All benchmark data is generated procedurally from a seeded PRNG β€” no external fixtures, no live API dependency. Every result is fully reproducible from a task ID and a random seed.


Conclusion


πŸ’¬ Can an agent still finish the job when the API fights back?


That question matters far beyond trade data. It applies to any agent expected to operate against real interfaces with pagination, retries, noisy outputs, and resource limits.

If we want more reliable agents, we need environments that reward reliability directly.
That is the role ComtradeBench is designed to play.


GitHub  Β·  HF Space  Β·  OpenEnv Framework