yonghongzhang
/

ComtradeBench-Blog

@@ -10,21 +10,41 @@ language:
 - en
 ---
-# ComtradeBench: An OpenEnv Benchmark for Reliable LLM Tool-Use
-**AgentBeats Phase 2 — OpenEnv Challenge Submission**
-Author: MateFin | [GitHub](https://github.com/yonghongzhang-io/comtrade-openenv) | [HF Space](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
 ---
 ## Agents should be judged by whether they finish the job
-Large language models are often evaluated on what they can say.
-Real agents, however, are judged by whether they can finish the job when tools fail.
 In practical API workflows, failure rarely comes from language alone. Pages drift. Duplicate rows appear across requests. Rate limits interrupt execution. Transient server errors force retries. Summary rows contaminate aggregates. Budgets make brute-force strategies impossible.
-These are not unusual edge cases. They are normal operating conditions for production systems.
 ComtradeBench is an OpenEnv benchmark designed to measure exactly this problem: can an LLM agent execute a multi-step API workflow reliably under realistic failure modes?
@@ -34,14 +54,16 @@ ComtradeBench is an OpenEnv benchmark designed to measure exactly this problem:
 Many current evaluations still focus on final answers, clean tool calls, or static environments. But deployed agents fail for more operational reasons:
-- they miss pages
-- they retry incorrectly
-- they double-count duplicate rows
-- they leak malformed summary records into outputs
-- they waste budget on redundant calls
-- they recover silently, without leaving an auditable trace
-These are execution failures, not just reasoning failures.
 If we want useful agents, we need benchmarks that measure reliable task completion under imperfect conditions — not only answer quality in idealized settings.
@@ -49,43 +71,48 @@ If we want useful agents, we need benchmarks that measure reliable task completi
 ## What ComtradeBench is
-ComtradeBench is an OpenEnv-native benchmark and training environment for reliable tool-use. It is instantiated through a paginated trade-data retrieval workflow, but the underlying problem is broader: robust multi-step API execution under shifting, imperfect, and partially adversarial conditions.
-The environment asks an agent to retrieve, clean, and submit records from a paginated API while handling realistic operational challenges such as:
-- pagination drift
-- duplicate records across pages
-- transient 429 and 500 errors
-- misleading summary rows
-- mixed-fault episodes
-- constrained request budgets
-The goal is not to test whether the agent can describe the workflow. The goal is to test whether it can execute it correctly, completely, efficiently, and robustly.
 ---
 ## Environment design
-Each episode gives the agent a parameterized retrieval task and a limited request budget. The agent must:
-1. Read the task specification
-2. Fetch all necessary pages
-3. Deduplicate records correctly
-4. Filter out contaminating totals rows
-5. Submit a clean final result with an execution trace
-The benchmark is structured as a curriculum of ten tasks, moving from baseline correctness to progressively harder reliability challenges — including mixed faults, adaptive fault escalation mid-episode, and tighter resource constraints.
-This progression matters. It allows us to separate distinct capabilities:
-- baseline correctness
-- pagination handling
-- data hygiene
-- retry behavior under transient errors
-- adaptability when conditions shift mid-episode
-- efficiency under constrained budgets
-Among these, the adaptive adversary task (T9) is, to our knowledge, among the earliest OpenEnv-style tasks to model within-episode fault escalation explicitly — where the environment becomes harder as the agent makes progress, rather than presenting a fixed challenge throughout.
 ---
@@ -93,57 +120,88 @@ Among these, the adaptive adversary task (T9) is, to our knowledge, among the ea
 We built ComtradeBench on OpenEnv because this benchmark is meant to be more than a one-off simulator.
-OpenEnv gives us a standard environment interface, reproducible execution, and clean integration with evaluation and post-training workflows. That makes ComtradeBench usable both as a benchmark and as a training substrate for improving agent reliability.
-Our goal is not only to score agents, but to provide a reusable environment where robustness can be studied systematically — and where agents can be trained against the same conditions they are evaluated on.
 ---
 ## Scoring what actually matters
-ComtradeBench uses structured evaluation rather than a binary success/failure label. Agents are scored across six dimensions:
 | Dimension | Weight | What it measures |
-|-----------|--------|-----------------|
-| Correctness | 30% | All expected rows present with correct field values |
 | Completeness | 15% | Zero missing records |
 | Robustness | 15% | Correct fault handling with logged evidence |
-| Efficiency | 15% | Request count relative to task-optimal minimum |
 | Data Quality | 15% | No duplicates or leaked totals rows |
 | Observability | 10% | Structured execution trace in the run log |
-This matters because reliable execution is multi-dimensional.
-An agent may retrieve correct-looking output while missing pages. Another may finish the task but waste budget. A third may recover from faults but leave no usable trace of what happened. These behaviors are not equivalent, and the benchmark does not treat them as equivalent.
-The Observability dimension is especially important. In real systems, agents must not only act correctly — they must also leave execution traces that are inspectable and auditable. Rewarding this behavior during training shapes better habits for deployment.
 ---
 ## Baselines and results
-A rule-based baseline agent achieves an average score of **96.8 / 100** across all ten tasks, confirming the environment is well-calibrated and solvable. The deterministic baseline's only consistent gap is on fault-injection tasks (T4, T5), where the Robustness dimension requires explicit logged evidence of retry behavior — correct data alone is not sufficient.
-An LLM agent evaluated with Moonshot V1-8K achieves an average score of **94.4 / 100** on tasks T1–T8. The LLM outperforms the rule-based baseline on Observability — natural language models generate more informative execution traces — but scores lower on fault tasks where retry logging is required. This directional finding suggests that GRPO training on the fault tasks would meaningfully improve overall scores by optimizing log-writing behavior alongside tool-sequencing decisions.
-![Benchmark Results](benchmark_results.png)
-We also ran 8 iterations of GRPO-style rollouts with group-relative advantage normalization. The training signal is reward-only — no human labels, no reward model. Mean reward exceeded the rule-based baseline in 6 of 8 iterations.
-![Training Curve](training_curve.png)
 ---
 ## What this benchmark reveals
-ComtradeBench is designed to expose a gap that clean evaluations often miss: agents can appear capable in idealized settings while remaining brittle in the face of operational noise.
-In our setting, the hardest problems are not usually "knowing what the API is." They are:
-- continuing correctly after an interruption
-- maintaining data integrity across many pages
-- adapting when the environment becomes less cooperative mid-episode
-- balancing coverage against cost
 This is where reliable agents differ from merely fluent ones.
@@ -151,38 +209,49 @@ This is where reliable agents differ from merely fluent ones.
 ## Benchmark and training substrate
-ComtradeBench is not just an evaluation harness. It is also built to support agent improvement.
-The environment ships with reproducible components for benchmarking, baseline comparison, and reward-based training. That makes it useful for studying not only how agents fail, but also which training signals improve reliability.
-This is an intentional design choice. If robust tool-use is a real bottleneck for agentic AI, then we need environments that can both measure and train that capability — with the same conditions present in evaluation and in training.
 ---
-## Open source and reproducible
-ComtradeBench is fully open source. The environment, evaluation code, and training pipeline are all public and designed for reuse.
-All benchmark data is generated procedurally from a seeded PRNG — no external fixtures, no live API dependencies. Any result is fully reproducible from a task ID and a random seed.
-The environment runs in-process with no external server required, deploys as a Docker container for evaluation, and integrates directly with the AgentBeats community evaluation platform via an A2A Green Agent wrapper.
 ---
 ## Conclusion
-ComtradeBench focuses on a simple but under-measured question:
-> Can an agent still finish the job when the API fights back?
 That question matters far beyond trade data. It applies to any agent expected to operate against real interfaces with pagination, retries, noisy outputs, and resource limits.
-If we want more reliable agents, we need environments that reward reliability directly. That is the role ComtradeBench is designed to play.
 ---
-**Links:**
-- Environment: [github.com/yonghongzhang-io/comtrade-openenv](https://github.com/yonghongzhang-io/comtrade-openenv)
-- HF Space: [huggingface.co/spaces/yonghongzhang/comtrade-env](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
-- OpenEnv framework: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)

 - en
 ---
+<p align="center">
+  <img src="benchmark_results.png" width="80%" alt="ComtradeBench Benchmark Results"/>
+</p>
+<h1 align="center">ComtradeBench</h1>
+<h3 align="center">An OpenEnv Benchmark for Reliable LLM Tool-Use</h3>
+<p align="center">
+  <a href="https://github.com/yonghongzhang-io/comtrade-openenv">
+    <img src="https://img.shields.io/badge/GitHub-Repository-181717?logo=github" alt="GitHub"/>
+  </a>
+  &nbsp;
+  <a href="https://huggingface.co/spaces/yonghongzhang/comtrade-env">
+    <img src="https://img.shields.io/badge/HF%20Space-Live%20Demo-FFD21E?logo=huggingface&logoColor=black" alt="HF Space"/>
+  </a>
+  &nbsp;
+  <img src="https://img.shields.io/badge/OpenEnv-Native-4B8BBE" alt="OpenEnv"/>
+  &nbsp;
+  <img src="https://img.shields.io/badge/Tasks-10-brightgreen" alt="10 Tasks"/>
+  &nbsp;
+  <img src="https://img.shields.io/badge/Training-GRPO-orange" alt="GRPO"/>
+</p>
+<p align="center"><em>AgentBeats Phase 2 — OpenEnv Challenge Submission &nbsp;|&nbsp; Author: MateFin</em></p>
 ---
 ## Agents should be judged by whether they finish the job
+Large language models are often evaluated on what they can **say**.
+Real agents, however, are judged by whether they can **finish the job** when tools fail.
 In practical API workflows, failure rarely comes from language alone. Pages drift. Duplicate rows appear across requests. Rate limits interrupt execution. Transient server errors force retries. Summary rows contaminate aggregates. Budgets make brute-force strategies impossible.
+These are not unusual edge cases. **They are normal operating conditions for production systems.**
 ComtradeBench is an OpenEnv benchmark designed to measure exactly this problem: can an LLM agent execute a multi-step API workflow reliably under realistic failure modes?
 Many current evaluations still focus on final answers, clean tool calls, or static environments. But deployed agents fail for more operational reasons:
+| Failure | What goes wrong |
+|---------|----------------|
+| Miss pages | Incomplete data submitted as complete |
+| Retry incorrectly | Page skipped after error — silent data gap |
+| Double-count duplicates | Overcounted rows, inflated aggregates |
+| Leak summary rows | Contaminated totals corrupt downstream analysis |
+| Waste budget | Redundant fetches exhaust request limit |
+| Recover silently | No auditable trace — failure invisible in production |
+These are **execution failures**, not just reasoning failures.
 If we want useful agents, we need benchmarks that measure reliable task completion under imperfect conditions — not only answer quality in idealized settings.
 ## What ComtradeBench is
+> ComtradeBench is an OpenEnv-native benchmark and training environment for reliable tool-use. The domain is trade-data retrieval; the problem is broader: robust multi-step API execution under shifting, imperfect, and partially adversarial conditions.
+The environment asks an agent to retrieve, clean, and submit records from a paginated API while handling:
+- **Pagination drift** — page ordering randomized between calls
+- **Duplicate records** — within-page (8%) and cross-page (3%) overlap
+- **Transient errors** — HTTP 429 rate-limits and HTTP 500 server faults
+- **Totals trap** — synthetic summary rows mixed into real data
+- **Mixed faults** — rate-limit retry + dedup simultaneously
+- **Constrained budget** — halved request limit, no room for waste
+The goal is not to test whether the agent can *describe* the workflow.
+The goal is to test whether it can *execute* it — correctly, completely, efficiently, and robustly.
 ---
 ## Environment design
+Each episode gives the agent a parameterized retrieval task and a limited request budget. The agent interacts through **three MCP tools only**:
+```
+get_task_info()         →  task parameters + request budget
+fetch_page(page, size)  →  {rows, has_more}  or  {status: 429|500, retry: true}
+submit_results(...)     →  {reward, score, breakdown}
+```
+The benchmark is structured as a **curriculum of ten tasks**:
+| # | Task | Core challenge |
+|---|------|----------------|
+| T1 | Single page | Baseline correctness |
+| T2 | Multi-page pagination | Merge 2,345+ rows across pages |
+| T3 | Duplicates | Primary-key deduplication |
+| T4 | HTTP 429 | Backoff + retry without data loss |
+| T5 | HTTP 500 | Transient error recovery |
+| T6 | Page drift | Canonicalize under non-deterministic ordering |
+| T7 | Totals trap | Filter `is_total=true` rows |
+| T8 | Mixed faults | Retry AND dedup simultaneously |
+| **T9** | **Adaptive adversary** | **Fault intensity escalates mid-episode** |
+| **T10** | **Constrained budget** | **50 requests instead of 100** |
+T9 is, to our knowledge, among the earliest OpenEnv-style tasks to model **within-episode fault escalation** — where the environment becomes harder as the agent makes progress.
 ---
 We built ComtradeBench on OpenEnv because this benchmark is meant to be more than a one-off simulator.
+OpenEnv gives us a standard environment interface, reproducible execution, and clean integration with evaluation and post-training workflows. The same environment code runs both in-process during GRPO training and as a deployed Docker service during evaluation — with no divergence.
+Our goal is not only to score agents, but to provide a **reusable environment where robustness can be studied and trained systematically**.
 ---
 ## Scoring what actually matters
+ComtradeBench uses structured evaluation across **six dimensions** — not a binary pass/fail:
 | Dimension | Weight | What it measures |
+|-----------|:------:|-----------------|
+| Correctness | **30%** | All expected rows present with correct field values |
 | Completeness | 15% | Zero missing records |
 | Robustness | 15% | Correct fault handling with logged evidence |
+| Efficiency | 15% | Request count vs. task-optimal minimum |
 | Data Quality | 15% | No duplicates or leaked totals rows |
 | Observability | 10% | Structured execution trace in the run log |
+**Why multi-dimensional scoring matters:**
+An agent that retrieves correct data but skips retry logging loses 15 points on Robustness. An agent that skips pages to save budget loses Completeness and all Efficiency credit. These behaviors are not equivalent — the benchmark does not treat them as equivalent.
+The **Observability** dimension deserves special note: requiring structured log entries incentivizes the agent to maintain explicit execution state. This is not artificial — structured logs are how production ETL pipelines are monitored and debugged.
 ---
 ## Baselines and results
+### Rule-based baseline (no LLM)
+A deterministic rule-based agent achieves **96.8 / 100** average across all ten tasks, confirming the environment is well-calibrated and solvable.
+| Task | Score | Reward |
+|------|------:|-------:|
+| T1 Single page | 98.0 | 0.980 |
+| T2 Multi-page | 98.0 | 0.980 |
+| T3 Duplicates | 98.0 | 0.980 |
+| T4 Rate limit (429) | 95.0 | 0.950 |
+| T5 Server error (500) | 95.7 | 0.957 |
+| T6 Page drift | 94.0 | 0.940 |
+| T7 Totals trap | 98.0 | 0.980 |
+| T8 Mixed faults | 96.4 | 0.964 |
+| T9 Adaptive adversary | 96.9 | 0.969 |
+| T10 Constrained budget | 98.0 | 0.980 |
+| **Average** | **96.8** | **0.968** |
+### LLM agent — Moonshot V1-8K (Kimi API)
+| Task | Score | Reward |
+|------|------:|-------:|
+| T1 Single page | 98.7 | 0.987 |
+| T2 Multi-page | 98.7 | 0.987 |
+| T3 Duplicates | 98.7 | 0.987 |
+| T4 Rate limit | 83.7 | 0.837 |
+| T5 Server error | 84.3 | 0.843 |
+| T6 Page drift | 94.7 | 0.947 |
+| T7 Totals trap | 98.7 | 0.987 |
+| T8 Mixed faults | 97.3 | 0.973 |
+| **Average** | **94.4** | **0.944** |
+The LLM outperforms the rule-based baseline on Observability — natural language models generate more informative execution traces. The gap on T4/T5 reflects that the Robustness dimension requires **explicit logged evidence** of retry behavior, not just correct output.
+### GRPO training curve
+We ran 8 iterations of GRPO-style rollouts with group-relative advantage normalization. Training signal is reward-only — no human labels, no reward model. Mean reward exceeded the rule-based baseline in **6 of 8 iterations**.
+<p align="center">
+  <img src="training_curve.png" width="80%" alt="GRPO Training Curve"/>
+</p>
 ---
 ## What this benchmark reveals
+ComtradeBench is designed to expose a gap that clean evaluations often miss: agents can appear capable in idealized settings while remaining brittle under operational noise.
+The hardest problems are not "knowing what the API is." They are:
+- continuing correctly **after an interruption**
+- maintaining data integrity **across many pages**
+- adapting when **conditions shift mid-episode**
+- balancing **coverage against cost**
 This is where reliable agents differ from merely fluent ones.
 ## Benchmark and training substrate
+ComtradeBench is not just an evaluation harness — it is built to support agent improvement.
+The environment ships with a full **GRPO training pipeline**: reproducible rollouts, group-relative advantage normalization, and reward-only optimization. No human labels needed. No separate reward model.
+This is an intentional design choice: if robust tool-use is a real bottleneck for agentic AI, we need environments that can **both measure and train** that capability — with identical conditions in evaluation and training.
 ---
+## Quick start
+```bash
+# No LLM, no GPU, no API key required
+git clone https://github.com/yonghongzhang-io/comtrade-openenv
+pip install openenv-core[core]
+python agent/smoke_test.py --task T1_single_page
+python agent/smoke_test.py --task T9_adaptive_adversary
+# GRPO training via local Ollama (CPU-capable)
+python agent/train_grpo.py \
+    --api-url http://localhost:11434/v1 \
+    --api-model qwen2.5:7b \
+    --num-iterations 200 --group-size 4
+```
+All benchmark data is generated procedurally from a seeded PRNG — no external fixtures, no live API dependency. Every result is fully reproducible from a task ID and a random seed.
 ---
 ## Conclusion
+> **Can an agent still finish the job when the API fights back?**
 That question matters far beyond trade data. It applies to any agent expected to operate against real interfaces with pagination, retries, noisy outputs, and resource limits.
+If we want more reliable agents, we need environments that reward reliability directly.
+That is the role ComtradeBench is designed to play.
 ---
+<p align="center">
+  <a href="https://github.com/yonghongzhang-io/comtrade-openenv">GitHub</a>
+  &nbsp;·&nbsp;
+  <a href="https://huggingface.co/spaces/yonghongzhang/comtrade-env">HF Space</a>
+  &nbsp;·&nbsp;
+  <a href="https://github.com/meta-pytorch/OpenEnv">OpenEnv Framework</a>
+</p>