yonghongzhang commited on
Commit
1e570e7
Β·
1 Parent(s): 70d5da6

Improve blog layout: badges, tables, visual hierarchy, quick start

Browse files
Files changed (1) hide show
  1. README.md +145 -76
README.md CHANGED
@@ -10,21 +10,41 @@ language:
10
  - en
11
  ---
12
 
13
- # ComtradeBench: An OpenEnv Benchmark for Reliable LLM Tool-Use
14
-
15
- **AgentBeats Phase 2 β€” OpenEnv Challenge Submission**
16
- Author: MateFin | [GitHub](https://github.com/yonghongzhang-io/comtrade-openenv) | [HF Space](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  ---
19
 
20
  ## Agents should be judged by whether they finish the job
21
 
22
- Large language models are often evaluated on what they can say.
23
- Real agents, however, are judged by whether they can finish the job when tools fail.
24
 
25
  In practical API workflows, failure rarely comes from language alone. Pages drift. Duplicate rows appear across requests. Rate limits interrupt execution. Transient server errors force retries. Summary rows contaminate aggregates. Budgets make brute-force strategies impossible.
26
 
27
- These are not unusual edge cases. They are normal operating conditions for production systems.
28
 
29
  ComtradeBench is an OpenEnv benchmark designed to measure exactly this problem: can an LLM agent execute a multi-step API workflow reliably under realistic failure modes?
30
 
@@ -34,14 +54,16 @@ ComtradeBench is an OpenEnv benchmark designed to measure exactly this problem:
34
 
35
  Many current evaluations still focus on final answers, clean tool calls, or static environments. But deployed agents fail for more operational reasons:
36
 
37
- - they miss pages
38
- - they retry incorrectly
39
- - they double-count duplicate rows
40
- - they leak malformed summary records into outputs
41
- - they waste budget on redundant calls
42
- - they recover silently, without leaving an auditable trace
 
 
43
 
44
- These are execution failures, not just reasoning failures.
45
 
46
  If we want useful agents, we need benchmarks that measure reliable task completion under imperfect conditions β€” not only answer quality in idealized settings.
47
 
@@ -49,43 +71,48 @@ If we want useful agents, we need benchmarks that measure reliable task completi
49
 
50
  ## What ComtradeBench is
51
 
52
- ComtradeBench is an OpenEnv-native benchmark and training environment for reliable tool-use. It is instantiated through a paginated trade-data retrieval workflow, but the underlying problem is broader: robust multi-step API execution under shifting, imperfect, and partially adversarial conditions.
53
 
54
- The environment asks an agent to retrieve, clean, and submit records from a paginated API while handling realistic operational challenges such as:
55
 
56
- - pagination drift
57
- - duplicate records across pages
58
- - transient 429 and 500 errors
59
- - misleading summary rows
60
- - mixed-fault episodes
61
- - constrained request budgets
62
 
63
- The goal is not to test whether the agent can describe the workflow. The goal is to test whether it can execute it correctly, completely, efficiently, and robustly.
 
64
 
65
  ---
66
 
67
  ## Environment design
68
 
69
- Each episode gives the agent a parameterized retrieval task and a limited request budget. The agent must:
70
-
71
- 1. Read the task specification
72
- 2. Fetch all necessary pages
73
- 3. Deduplicate records correctly
74
- 4. Filter out contaminating totals rows
75
- 5. Submit a clean final result with an execution trace
76
 
77
- The benchmark is structured as a curriculum of ten tasks, moving from baseline correctness to progressively harder reliability challenges β€” including mixed faults, adaptive fault escalation mid-episode, and tighter resource constraints.
 
 
 
 
78
 
79
- This progression matters. It allows us to separate distinct capabilities:
80
 
81
- - baseline correctness
82
- - pagination handling
83
- - data hygiene
84
- - retry behavior under transient errors
85
- - adaptability when conditions shift mid-episode
86
- - efficiency under constrained budgets
 
 
 
 
 
 
87
 
88
- Among these, the adaptive adversary task (T9) is, to our knowledge, among the earliest OpenEnv-style tasks to model within-episode fault escalation explicitly β€” where the environment becomes harder as the agent makes progress, rather than presenting a fixed challenge throughout.
89
 
90
  ---
91
 
@@ -93,57 +120,88 @@ Among these, the adaptive adversary task (T9) is, to our knowledge, among the ea
93
 
94
  We built ComtradeBench on OpenEnv because this benchmark is meant to be more than a one-off simulator.
95
 
96
- OpenEnv gives us a standard environment interface, reproducible execution, and clean integration with evaluation and post-training workflows. That makes ComtradeBench usable both as a benchmark and as a training substrate for improving agent reliability.
97
 
98
- Our goal is not only to score agents, but to provide a reusable environment where robustness can be studied systematically β€” and where agents can be trained against the same conditions they are evaluated on.
99
 
100
  ---
101
 
102
  ## Scoring what actually matters
103
 
104
- ComtradeBench uses structured evaluation rather than a binary success/failure label. Agents are scored across six dimensions:
105
 
106
  | Dimension | Weight | What it measures |
107
- |-----------|--------|-----------------|
108
- | Correctness | 30% | All expected rows present with correct field values |
109
  | Completeness | 15% | Zero missing records |
110
  | Robustness | 15% | Correct fault handling with logged evidence |
111
- | Efficiency | 15% | Request count relative to task-optimal minimum |
112
  | Data Quality | 15% | No duplicates or leaked totals rows |
113
  | Observability | 10% | Structured execution trace in the run log |
114
 
115
- This matters because reliable execution is multi-dimensional.
116
-
117
- An agent may retrieve correct-looking output while missing pages. Another may finish the task but waste budget. A third may recover from faults but leave no usable trace of what happened. These behaviors are not equivalent, and the benchmark does not treat them as equivalent.
118
 
119
- The Observability dimension is especially important. In real systems, agents must not only act correctly β€” they must also leave execution traces that are inspectable and auditable. Rewarding this behavior during training shapes better habits for deployment.
120
 
121
  ---
122
 
123
  ## Baselines and results
124
 
125
- A rule-based baseline agent achieves an average score of **96.8 / 100** across all ten tasks, confirming the environment is well-calibrated and solvable. The deterministic baseline's only consistent gap is on fault-injection tasks (T4, T5), where the Robustness dimension requires explicit logged evidence of retry behavior β€” correct data alone is not sufficient.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
- An LLM agent evaluated with Moonshot V1-8K achieves an average score of **94.4 / 100** on tasks T1–T8. The LLM outperforms the rule-based baseline on Observability β€” natural language models generate more informative execution traces β€” but scores lower on fault tasks where retry logging is required. This directional finding suggests that GRPO training on the fault tasks would meaningfully improve overall scores by optimizing log-writing behavior alongside tool-sequencing decisions.
128
 
129
- ![Benchmark Results](benchmark_results.png)
 
 
 
 
 
 
 
 
 
 
130
 
131
- We also ran 8 iterations of GRPO-style rollouts with group-relative advantage normalization. The training signal is reward-only β€” no human labels, no reward model. Mean reward exceeded the rule-based baseline in 6 of 8 iterations.
132
 
133
- ![Training Curve](training_curve.png)
 
 
 
 
 
 
134
 
135
  ---
136
 
137
  ## What this benchmark reveals
138
 
139
- ComtradeBench is designed to expose a gap that clean evaluations often miss: agents can appear capable in idealized settings while remaining brittle in the face of operational noise.
140
 
141
- In our setting, the hardest problems are not usually "knowing what the API is." They are:
142
 
143
- - continuing correctly after an interruption
144
- - maintaining data integrity across many pages
145
- - adapting when the environment becomes less cooperative mid-episode
146
- - balancing coverage against cost
147
 
148
  This is where reliable agents differ from merely fluent ones.
149
 
@@ -151,38 +209,49 @@ This is where reliable agents differ from merely fluent ones.
151
 
152
  ## Benchmark and training substrate
153
 
154
- ComtradeBench is not just an evaluation harness. It is also built to support agent improvement.
155
 
156
- The environment ships with reproducible components for benchmarking, baseline comparison, and reward-based training. That makes it useful for studying not only how agents fail, but also which training signals improve reliability.
157
 
158
- This is an intentional design choice. If robust tool-use is a real bottleneck for agentic AI, then we need environments that can both measure and train that capability β€” with the same conditions present in evaluation and in training.
159
 
160
  ---
161
 
162
- ## Open source and reproducible
163
 
164
- ComtradeBench is fully open source. The environment, evaluation code, and training pipeline are all public and designed for reuse.
 
 
 
 
 
165
 
166
- All benchmark data is generated procedurally from a seeded PRNG β€” no external fixtures, no live API dependencies. Any result is fully reproducible from a task ID and a random seed.
 
 
 
 
 
167
 
168
- The environment runs in-process with no external server required, deploys as a Docker container for evaluation, and integrates directly with the AgentBeats community evaluation platform via an A2A Green Agent wrapper.
169
 
170
  ---
171
 
172
  ## Conclusion
173
 
174
- ComtradeBench focuses on a simple but under-measured question:
175
-
176
- > Can an agent still finish the job when the API fights back?
177
 
178
  That question matters far beyond trade data. It applies to any agent expected to operate against real interfaces with pagination, retries, noisy outputs, and resource limits.
179
 
180
- If we want more reliable agents, we need environments that reward reliability directly. That is the role ComtradeBench is designed to play.
 
181
 
182
  ---
183
 
184
- **Links:**
185
-
186
- - Environment: [github.com/yonghongzhang-io/comtrade-openenv](https://github.com/yonghongzhang-io/comtrade-openenv)
187
- - HF Space: [huggingface.co/spaces/yonghongzhang/comtrade-env](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
188
- - OpenEnv framework: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
 
 
 
10
  - en
11
  ---
12
 
13
+ <p align="center">
14
+ <img src="benchmark_results.png" width="80%" alt="ComtradeBench Benchmark Results"/>
15
+ </p>
16
+
17
+ <h1 align="center">ComtradeBench</h1>
18
+ <h3 align="center">An OpenEnv Benchmark for Reliable LLM Tool-Use</h3>
19
+
20
+ <p align="center">
21
+ <a href="https://github.com/yonghongzhang-io/comtrade-openenv">
22
+ <img src="https://img.shields.io/badge/GitHub-Repository-181717?logo=github" alt="GitHub"/>
23
+ </a>
24
+ &nbsp;
25
+ <a href="https://huggingface.co/spaces/yonghongzhang/comtrade-env">
26
+ <img src="https://img.shields.io/badge/HF%20Space-Live%20Demo-FFD21E?logo=huggingface&logoColor=black" alt="HF Space"/>
27
+ </a>
28
+ &nbsp;
29
+ <img src="https://img.shields.io/badge/OpenEnv-Native-4B8BBE" alt="OpenEnv"/>
30
+ &nbsp;
31
+ <img src="https://img.shields.io/badge/Tasks-10-brightgreen" alt="10 Tasks"/>
32
+ &nbsp;
33
+ <img src="https://img.shields.io/badge/Training-GRPO-orange" alt="GRPO"/>
34
+ </p>
35
+
36
+ <p align="center"><em>AgentBeats Phase 2 β€” OpenEnv Challenge Submission &nbsp;|&nbsp; Author: MateFin</em></p>
37
 
38
  ---
39
 
40
  ## Agents should be judged by whether they finish the job
41
 
42
+ Large language models are often evaluated on what they can **say**.
43
+ Real agents, however, are judged by whether they can **finish the job** when tools fail.
44
 
45
  In practical API workflows, failure rarely comes from language alone. Pages drift. Duplicate rows appear across requests. Rate limits interrupt execution. Transient server errors force retries. Summary rows contaminate aggregates. Budgets make brute-force strategies impossible.
46
 
47
+ These are not unusual edge cases. **They are normal operating conditions for production systems.**
48
 
49
  ComtradeBench is an OpenEnv benchmark designed to measure exactly this problem: can an LLM agent execute a multi-step API workflow reliably under realistic failure modes?
50
 
 
54
 
55
  Many current evaluations still focus on final answers, clean tool calls, or static environments. But deployed agents fail for more operational reasons:
56
 
57
+ | Failure | What goes wrong |
58
+ |---------|----------------|
59
+ | Miss pages | Incomplete data submitted as complete |
60
+ | Retry incorrectly | Page skipped after error β€” silent data gap |
61
+ | Double-count duplicates | Overcounted rows, inflated aggregates |
62
+ | Leak summary rows | Contaminated totals corrupt downstream analysis |
63
+ | Waste budget | Redundant fetches exhaust request limit |
64
+ | Recover silently | No auditable trace β€” failure invisible in production |
65
 
66
+ These are **execution failures**, not just reasoning failures.
67
 
68
  If we want useful agents, we need benchmarks that measure reliable task completion under imperfect conditions β€” not only answer quality in idealized settings.
69
 
 
71
 
72
  ## What ComtradeBench is
73
 
74
+ > ComtradeBench is an OpenEnv-native benchmark and training environment for reliable tool-use. The domain is trade-data retrieval; the problem is broader: robust multi-step API execution under shifting, imperfect, and partially adversarial conditions.
75
 
76
+ The environment asks an agent to retrieve, clean, and submit records from a paginated API while handling:
77
 
78
+ - **Pagination drift** β€” page ordering randomized between calls
79
+ - **Duplicate records** β€” within-page (8%) and cross-page (3%) overlap
80
+ - **Transient errors** β€” HTTP 429 rate-limits and HTTP 500 server faults
81
+ - **Totals trap** β€” synthetic summary rows mixed into real data
82
+ - **Mixed faults** β€” rate-limit retry + dedup simultaneously
83
+ - **Constrained budget** β€” halved request limit, no room for waste
84
 
85
+ The goal is not to test whether the agent can *describe* the workflow.
86
+ The goal is to test whether it can *execute* it β€” correctly, completely, efficiently, and robustly.
87
 
88
  ---
89
 
90
  ## Environment design
91
 
92
+ Each episode gives the agent a parameterized retrieval task and a limited request budget. The agent interacts through **three MCP tools only**:
 
 
 
 
 
 
93
 
94
+ ```
95
+ get_task_info() β†’ task parameters + request budget
96
+ fetch_page(page, size) β†’ {rows, has_more} or {status: 429|500, retry: true}
97
+ submit_results(...) β†’ {reward, score, breakdown}
98
+ ```
99
 
100
+ The benchmark is structured as a **curriculum of ten tasks**:
101
 
102
+ | # | Task | Core challenge |
103
+ |---|------|----------------|
104
+ | T1 | Single page | Baseline correctness |
105
+ | T2 | Multi-page pagination | Merge 2,345+ rows across pages |
106
+ | T3 | Duplicates | Primary-key deduplication |
107
+ | T4 | HTTP 429 | Backoff + retry without data loss |
108
+ | T5 | HTTP 500 | Transient error recovery |
109
+ | T6 | Page drift | Canonicalize under non-deterministic ordering |
110
+ | T7 | Totals trap | Filter `is_total=true` rows |
111
+ | T8 | Mixed faults | Retry AND dedup simultaneously |
112
+ | **T9** | **Adaptive adversary** | **Fault intensity escalates mid-episode** |
113
+ | **T10** | **Constrained budget** | **50 requests instead of 100** |
114
 
115
+ T9 is, to our knowledge, among the earliest OpenEnv-style tasks to model **within-episode fault escalation** β€” where the environment becomes harder as the agent makes progress.
116
 
117
  ---
118
 
 
120
 
121
  We built ComtradeBench on OpenEnv because this benchmark is meant to be more than a one-off simulator.
122
 
123
+ OpenEnv gives us a standard environment interface, reproducible execution, and clean integration with evaluation and post-training workflows. The same environment code runs both in-process during GRPO training and as a deployed Docker service during evaluation β€” with no divergence.
124
 
125
+ Our goal is not only to score agents, but to provide a **reusable environment where robustness can be studied and trained systematically**.
126
 
127
  ---
128
 
129
  ## Scoring what actually matters
130
 
131
+ ComtradeBench uses structured evaluation across **six dimensions** β€” not a binary pass/fail:
132
 
133
  | Dimension | Weight | What it measures |
134
+ |-----------|:------:|-----------------|
135
+ | Correctness | **30%** | All expected rows present with correct field values |
136
  | Completeness | 15% | Zero missing records |
137
  | Robustness | 15% | Correct fault handling with logged evidence |
138
+ | Efficiency | 15% | Request count vs. task-optimal minimum |
139
  | Data Quality | 15% | No duplicates or leaked totals rows |
140
  | Observability | 10% | Structured execution trace in the run log |
141
 
142
+ **Why multi-dimensional scoring matters:**
143
+ An agent that retrieves correct data but skips retry logging loses 15 points on Robustness. An agent that skips pages to save budget loses Completeness and all Efficiency credit. These behaviors are not equivalent β€” the benchmark does not treat them as equivalent.
 
144
 
145
+ The **Observability** dimension deserves special note: requiring structured log entries incentivizes the agent to maintain explicit execution state. This is not artificial β€” structured logs are how production ETL pipelines are monitored and debugged.
146
 
147
  ---
148
 
149
  ## Baselines and results
150
 
151
+ ### Rule-based baseline (no LLM)
152
+
153
+ A deterministic rule-based agent achieves **96.8 / 100** average across all ten tasks, confirming the environment is well-calibrated and solvable.
154
+
155
+ | Task | Score | Reward |
156
+ |------|------:|-------:|
157
+ | T1 Single page | 98.0 | 0.980 |
158
+ | T2 Multi-page | 98.0 | 0.980 |
159
+ | T3 Duplicates | 98.0 | 0.980 |
160
+ | T4 Rate limit (429) | 95.0 | 0.950 |
161
+ | T5 Server error (500) | 95.7 | 0.957 |
162
+ | T6 Page drift | 94.0 | 0.940 |
163
+ | T7 Totals trap | 98.0 | 0.980 |
164
+ | T8 Mixed faults | 96.4 | 0.964 |
165
+ | T9 Adaptive adversary | 96.9 | 0.969 |
166
+ | T10 Constrained budget | 98.0 | 0.980 |
167
+ | **Average** | **96.8** | **0.968** |
168
 
169
+ ### LLM agent β€” Moonshot V1-8K (Kimi API)
170
 
171
+ | Task | Score | Reward |
172
+ |------|------:|-------:|
173
+ | T1 Single page | 98.7 | 0.987 |
174
+ | T2 Multi-page | 98.7 | 0.987 |
175
+ | T3 Duplicates | 98.7 | 0.987 |
176
+ | T4 Rate limit | 83.7 | 0.837 |
177
+ | T5 Server error | 84.3 | 0.843 |
178
+ | T6 Page drift | 94.7 | 0.947 |
179
+ | T7 Totals trap | 98.7 | 0.987 |
180
+ | T8 Mixed faults | 97.3 | 0.973 |
181
+ | **Average** | **94.4** | **0.944** |
182
 
183
+ The LLM outperforms the rule-based baseline on Observability β€” natural language models generate more informative execution traces. The gap on T4/T5 reflects that the Robustness dimension requires **explicit logged evidence** of retry behavior, not just correct output.
184
 
185
+ ### GRPO training curve
186
+
187
+ We ran 8 iterations of GRPO-style rollouts with group-relative advantage normalization. Training signal is reward-only β€” no human labels, no reward model. Mean reward exceeded the rule-based baseline in **6 of 8 iterations**.
188
+
189
+ <p align="center">
190
+ <img src="training_curve.png" width="80%" alt="GRPO Training Curve"/>
191
+ </p>
192
 
193
  ---
194
 
195
  ## What this benchmark reveals
196
 
197
+ ComtradeBench is designed to expose a gap that clean evaluations often miss: agents can appear capable in idealized settings while remaining brittle under operational noise.
198
 
199
+ The hardest problems are not "knowing what the API is." They are:
200
 
201
+ - continuing correctly **after an interruption**
202
+ - maintaining data integrity **across many pages**
203
+ - adapting when **conditions shift mid-episode**
204
+ - balancing **coverage against cost**
205
 
206
  This is where reliable agents differ from merely fluent ones.
207
 
 
209
 
210
  ## Benchmark and training substrate
211
 
212
+ ComtradeBench is not just an evaluation harness β€” it is built to support agent improvement.
213
 
214
+ The environment ships with a full **GRPO training pipeline**: reproducible rollouts, group-relative advantage normalization, and reward-only optimization. No human labels needed. No separate reward model.
215
 
216
+ This is an intentional design choice: if robust tool-use is a real bottleneck for agentic AI, we need environments that can **both measure and train** that capability β€” with identical conditions in evaluation and training.
217
 
218
  ---
219
 
220
+ ## Quick start
221
 
222
+ ```bash
223
+ # No LLM, no GPU, no API key required
224
+ git clone https://github.com/yonghongzhang-io/comtrade-openenv
225
+ pip install openenv-core[core]
226
+ python agent/smoke_test.py --task T1_single_page
227
+ python agent/smoke_test.py --task T9_adaptive_adversary
228
 
229
+ # GRPO training via local Ollama (CPU-capable)
230
+ python agent/train_grpo.py \
231
+ --api-url http://localhost:11434/v1 \
232
+ --api-model qwen2.5:7b \
233
+ --num-iterations 200 --group-size 4
234
+ ```
235
 
236
+ All benchmark data is generated procedurally from a seeded PRNG β€” no external fixtures, no live API dependency. Every result is fully reproducible from a task ID and a random seed.
237
 
238
  ---
239
 
240
  ## Conclusion
241
 
242
+ > **Can an agent still finish the job when the API fights back?**
 
 
243
 
244
  That question matters far beyond trade data. It applies to any agent expected to operate against real interfaces with pagination, retries, noisy outputs, and resource limits.
245
 
246
+ If we want more reliable agents, we need environments that reward reliability directly.
247
+ That is the role ComtradeBench is designed to play.
248
 
249
  ---
250
 
251
+ <p align="center">
252
+ <a href="https://github.com/yonghongzhang-io/comtrade-openenv">GitHub</a>
253
+ &nbsp;Β·&nbsp;
254
+ <a href="https://huggingface.co/spaces/yonghongzhang/comtrade-env">HF Space</a>
255
+ &nbsp;Β·&nbsp;
256
+ <a href="https://github.com/meta-pytorch/OpenEnv">OpenEnv Framework</a>
257
+ </p>