Yonghong commited on
Commit
06ff886
·
1 Parent(s): 1f3cfb2

Publish ComtradeBench blog — AgentBeats Phase 2 OpenEnv Challenge submission

Browse files
Files changed (1) hide show
  1. README.md +354 -0
README.md ADDED
@@ -0,0 +1,354 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "ComtradeBench: Teaching LLM Agents to Fetch Trade Data Reliably"
3
+ emoji: 📊
4
+ colorFrom: indigo
5
+ colorTo: gray
6
+ tags:
7
+ - openenv
8
+ - rl-environment
9
+ - agentbeats
10
+ - grpo
11
+ - llm-agent
12
+ - mcp
13
+ - competition
14
+ license: mit
15
+ ---
16
+
17
+ # Green Comtrade Bench: Teaching LLM Agents to Fetch Trade Data Reliably
18
+
19
+ **AgentBeats Phase 2 — OpenEnv Challenge Submission**
20
+ Author: Yonghong Zhang | [GitHub](https://github.com/yonghongzhang-io/comtrade-openenv) | [HF Space](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
21
+
22
+ ---
23
+
24
+ ## Motivation
25
+
26
+ Real-world data pipelines are messy. They paginate. They rate-limit you. They return duplicates across page boundaries. They inject summary rows into data feeds. They reorder results non-deterministically between calls.
27
+
28
+ Most LLM benchmarks evaluate reasoning in clean, single-turn settings. We asked: **can an LLM agent reliably fetch and clean real-world paginated API data under adversarial conditions?**
29
+
30
+ To answer this, we built **Green Comtrade Bench** — an eight-task OpenEnv environment where an LLM agent must interact with a simulated UN Comtrade trade statistics API, handle faults gracefully, and submit clean deduplicated output.
31
+
32
+ ---
33
+
34
+ ## Environment Design
35
+
36
+ ### The Task
37
+
38
+ The agent is given a trade data query (reporter country, partner country, trade flow, HS product code, year). It must:
39
+
40
+ 1. Discover pagination bounds via the API
41
+ 2. Fetch all pages until `has_more=False`
42
+ 3. Deduplicate records by primary key `(year, reporter, partner, flow, hs, record_id)`
43
+ 4. Drop summary rows (`is_total=true`)
44
+ 5. Submit a JSONL file with clean data + metadata + execution log
45
+
46
+ The agent has a budget of 100 requests per episode.
47
+
48
+ ### Three MCP Tools
49
+
50
+ The environment exposes exactly three tools via the Model Context Protocol (MCP):
51
+
52
+ ```
53
+ get_task_info()
54
+ → Returns task parameters, mock service URL, and request budget.
55
+
56
+ fetch_page(page: int, page_size: int = 500)
57
+ → Fetches one page. Returns {rows, page, total_pages, has_more}.
58
+ On fault: {status: 429|500, retry: true}
59
+
60
+ submit_results(data_jsonl, metadata_json, run_log)
61
+ → Scores the submission. Returns {reward, score, breakdown, errors}.
62
+ ```
63
+
64
+ This minimal interface mirrors how real API agents are constrained: the agent cannot inspect internal state, cannot bypass pagination, and cannot retry with a fresh session.
65
+
66
+ ### Eight Tasks — Progressive Difficulty
67
+
68
+ | Task | Fault Injected | Key Challenge | Difficulty |
69
+ |------|---------------|---------------|------------|
70
+ | T1 | None | Schema validation, baseline correctness | Easy |
71
+ | T2 | Pagination only | Multi-page merge (2,345 rows across 5+ pages) | Easy |
72
+ | T3 | 8% within-page + 3% cross-page duplicates | Primary-key deduplication | Medium |
73
+ | T4 | HTTP 429 on page 2 | Backoff + retry without data loss | Medium |
74
+ | T5 | HTTP 500 on page 2 | Transient error handling | Medium |
75
+ | T6 | Non-deterministic page ordering | Canonicalization + dedup under drift | Hard |
76
+ | T7 | `is_total=true` summary rows mixed in | Totals-trap filtering | Hard |
77
+ | T8 | 429 rate-limit + cross-page duplicates | Both retry AND dedup simultaneously | Hard |
78
+
79
+ Tasks are drawn from real UN Comtrade API behaviors: the pagination drift, duplicate records, and totals rows are documented failure modes that production ETL pipelines routinely encounter. T8 is the hardest task — it combines two independent failure modes that must both be handled correctly.
80
+
81
+ ### Mock Service Architecture
82
+
83
+ The embedded mock service is a FastAPI application with per-task fault injection:
84
+
85
+ ```
86
+ comtrade_env/
87
+ ├── server/
88
+ │ ├── comtrade_env_environment.py ← MCPEnvironment (3 MCP tools)
89
+ │ ├── tasks.py ← Task definitions T1-T8
90
+ │ ├── judge.py ← Scoring engine
91
+ │ └── mock_service/
92
+ │ └── app.py ← Stateless /api/data with fault injection
93
+ ```
94
+
95
+ The mock service is **stateless**: each request reconstructs the response from task configuration + request parameters. This makes the environment reproducible and concurrent-safe — multiple agents can run simultaneously without shared state corruption.
96
+
97
+ ### Scoring (0–100 → reward 0.0–1.0)
98
+
99
+ The judge evaluates six dimensions:
100
+
101
+ | Dimension | Weight | What it measures |
102
+ |-----------|--------|-----------------|
103
+ | Correctness | 30 | Row-level accuracy (content + count) |
104
+ | Completeness | 15 | Zero missing records |
105
+ | Robustness | 15 | Correct fault handling (429/500 retry) |
106
+ | Efficiency | 15 | Request count vs. task baseline |
107
+ | Data Quality | 15 | No duplicates leaked, no totals rows |
108
+ | Observability | 10 | Log contains `task_id=`, `page=`, `request=`, `complete=` |
109
+
110
+ **Governance rules prevent gaming:**
111
+ - Efficiency and Observability points are capped at 50% if Correctness < 70%
112
+ - Efficiency points require 100% Completeness — you cannot skip pages and claim efficiency
113
+ - Execution time > 45s incurs a penalty (max 3 points)
114
+
115
+ ---
116
+
117
+ ## LLM Agent Design
118
+
119
+ ### Agentic Loop
120
+
121
+ The agent (`llm_agent/agent.py`) runs a standard tool-use loop:
122
+
123
+ ```
124
+ SYSTEM_PROMPT + task description
125
+
126
+ LLM generates <tool_call>{...}</tool_call>
127
+
128
+ Environment executes tool
129
+
130
+ <tool_result>{...}</tool_result> appended to context
131
+
132
+ repeat until submit_results called
133
+ ```
134
+
135
+ Tool calls use a lightweight XML format that works with any instruction-tuned model:
136
+
137
+ ```xml
138
+ <tool_call>{"name": "fetch_page", "arguments": {"page": 1}}</tool_call>
139
+ ```
140
+
141
+ The agent handles the protocol details (deduplication, retry on 429/500, totals filtering) in its loop logic, not by prompting the model to implement them. This keeps the model focused on **sequencing decisions** (which page to fetch next, when to submit) while the infrastructure handles correctness invariants.
142
+
143
+ ### Fault Handling
144
+
145
+ ```python
146
+ # Retry on transient faults
147
+ if tool_result.get("status") in (429, 500) or tool_result.get("retry"):
148
+ wait = 2 * (retry_count + 1)
149
+ time.sleep(wait)
150
+ tool_result = self.env.call_tool(tool_name, tool_args)
151
+
152
+ # Dedup + totals filter on every fetch_page
153
+ for row in tool_result["rows"]:
154
+ if row.get("is_total"):
155
+ continue
156
+ pk = "|".join(str(row.get(k, "")) for k in
157
+ ("year", "reporter", "partner", "flow", "hs", "record_id"))
158
+ collected_rows[pk] = row # dict assignment = automatic dedup
159
+ ```
160
+
161
+ ### Backend Flexibility
162
+
163
+ The `LLMBackend` class supports two modes:
164
+
165
+ ```python
166
+ # Local HuggingFace model
167
+ backend = LLMBackend.from_hf("Qwen/Qwen2.5-7B-Instruct")
168
+
169
+ # OpenAI-compatible API (vLLM, Ollama, Together, etc.)
170
+ backend = LLMBackend.from_api("http://localhost:11434/v1", "qwen2.5:7b")
171
+ ```
172
+
173
+ ---
174
+
175
+ ## GRPO Training
176
+
177
+ We implement **Group Relative Policy Optimization** (GRPO, from DeepSeekMath) to train the agent purely from environment reward signals — no human-labeled data needed.
178
+
179
+ ### Why GRPO for Agentic Tasks
180
+
181
+ Standard RLHF requires a separate reward model. GRPO replaces it with **group-relative normalization**: run `G` episodes per task, compute each episode's advantage as `(reward - group_mean) / group_std`. This:
182
+
183
+ - Eliminates reward model training overhead
184
+ - Naturally handles sparse rewards (most steps get reward only at episode end)
185
+ - Scales to long multi-turn trajectories without value function estimation
186
+
187
+ ### Implementation (`llm_agent/train_grpo.py`)
188
+
189
+ ```python
190
+ def grpo_loss(log_probs, advantages, old_log_probs, ref_log_probs,
191
+ clip_eps=0.2, kl_coeff=0.04):
192
+ """Clipped surrogate + KL penalty."""
193
+ ratio = torch.exp(log_probs - old_log_probs)
194
+ clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
195
+ pg_loss = -torch.min(ratio * advantages, clipped * advantages).mean()
196
+
197
+ kl = (log_probs - ref_log_probs).mean()
198
+ return pg_loss + kl_coeff * kl
199
+ ```
200
+
201
+ Training loop:
202
+ 1. **Rollout phase**: run `G=4` episodes per task using current policy
203
+ 2. **Advantage computation**: `A_i = (r_i - mean_group) / (std_group + 1e-8)`
204
+ 3. **Policy update**: minimize GRPO loss over all trajectory tokens
205
+ 4. **Checkpoint**: save every 50 iterations; monitor per-task reward
206
+
207
+ ### Key Hyperparameters
208
+
209
+ | Parameter | Value | Rationale |
210
+ |-----------|-------|-----------|
211
+ | `clip_eps` | 0.2 | Standard PPO clip; prevents large policy jumps |
212
+ | `kl_coeff` | 0.04 | Light KL penalty; allows exploration |
213
+ | `group_size` | 4 | 4 rollouts per task per iteration |
214
+ | `lr` | 1e-5 | Conservative for fine-tuning |
215
+ | `max_steps` | 30 | Sufficient for all T1-T7 tasks |
216
+
217
+ ---
218
+
219
+ ## Results
220
+
221
+ ### Rule-Based Baseline (no LLM)
222
+
223
+ The deterministic baseline agent in `smoke_test.py` achieves high scores on all tasks, validating the environment and scoring machinery end-to-end:
224
+
225
+ | Task | Score | Reward | Breakdown |
226
+ |------|-------|--------|-----------|
227
+ | T1 single page | 95.0 | 0.9500 | corr=30 comp=15 robu=12 effi=15 data=15 obs=8 |
228
+ | T2 multi-page | 98.0 | 0.9800 | corr=30 comp=15 robu=15 effi=15 data=15 obs=8 |
229
+ | T3 duplicates | 98.0 | 0.9800 | corr=30 comp=15 robu=15 effi=15 data=15 obs=8 |
230
+ | T4 rate-limit 429 | 83.0 | 0.8300 | corr=30 comp=15 robu=0 effi=15 data=15 obs=8 |
231
+ | T5 server error 500 | 83.7 | 0.8370 | corr=30 comp=15 robu=0 effi=15 data=15 obs=8.7 |
232
+ | T6 page drift | 94.3 | 0.9430 | corr=26.3 comp=15 robu=15 effi=15 data=15 obs=8 |
233
+ | T7 totals trap | 96.0 | 0.9600 | corr=28 comp=15 robu=15 effi=15 data=15 obs=8 |
234
+ | **Average** | **92.6** | **0.9257** | |
235
+
236
+ All scores from `inference.py --mode rule-based` (deterministic, no LLM, reproducible). Full breakdown available in `inference_results_baseline.json`.
237
+
238
+ ### LLM Agent Results
239
+
240
+ We evaluated two LLM backends via the agentic loop described above: LLM decides tool sequencing, while the infrastructure handles dedup, retry, and submission.
241
+
242
+ **Moonshot V1-8K (Kimi) — closed-source, 8 GRPO rollout iterations:**
243
+
244
+ | Iteration | Mean Reward | Max Reward | Tasks Evaluated |
245
+ |-----------|-------------|------------|-----------------|
246
+ | 1 | 0.987 | 0.987 | T3, T1 |
247
+ | 2 | 0.967 | 0.987 | T6, T2 |
248
+ | 3 | 0.902 | 0.967 | T4, T7 |
249
+ | 4-8 | 0.912-0.987 | 0.987 | Mixed |
250
+
251
+ **Qwen 2.5-7B-Instruct (open-source, via Ollama) — rollout-only mode:**
252
+
253
+ | Task | Reward | Notes |
254
+ |------|--------|-------|
255
+ | T1 Single page | 0.950 | Matches rule-based baseline |
256
+ | T2 Multi-page | 0.890 | Sometimes misses last page |
257
+ | T3 Duplicates | 0.870 | Partial dedup in prompt-only mode |
258
+ | T4 Rate limit | 0.780 | Wastes budget on extra retries |
259
+ | T7 Totals trap | 0.920 | Correctly filters most totals rows |
260
+ | T8 Mixed faults | 0.720 | Hardest — both retry and dedup needed |
261
+
262
+ *Note: Qwen results are from rollout-only mode (no gradient updates). Full GRPO training with gradient steps requires GPU; the training pipeline is validated but large-scale runs are pending HuggingFace compute credits.*
263
+
264
+ Key findings:
265
+ - **Moonshot V1 achieves 0.987 reward on simple tasks** (T1, T2, T3) — matching or exceeding the rule-based baseline on Observability (the LLM naturally generates structured logs)
266
+ - **Qwen 2.5-7B scores lower on fault tasks** — expected for a 7B open model without gradient training
267
+ - **Fault tasks are genuinely harder**: T4 (0.780) and T8 (0.720) show the environment discriminates between capable and limited agents
268
+ - **The gap between rule-based (0.926) and LLM baseline (0.855 avg Qwen) is exactly what GRPO training should close**
269
+
270
+ ### What the Scoring Reveals
271
+
272
+ The rule-based baseline loses points on two dimensions:
273
+
274
+ - **Observability**: the run log requires specific structured entries (`task_id=`, `page=N`, `request=N`, `complete=true`); a naive agent that omits these loses up to 10 points
275
+ - **Efficiency**: fault-injection tasks (T4/T5/T6) require one or more retries, consuming extra request budget against the task baseline
276
+
277
+ The LLM agent improves on Observability (naturally verbose logs) but sometimes regresses on Efficiency (unnecessary fetches). This trade-off is exactly what GRPO gradient training would optimize: with a local HuggingFace model, the clipped surrogate loss would push the policy toward efficient tool sequences while the KL penalty prevents forgetting correct pagination behavior.
278
+
279
+ ---
280
+
281
+ ## OpenEnv Integration
282
+
283
+ The environment follows the OpenEnv contract exactly:
284
+
285
+ ```python
286
+ class ComtradeEnvironment(MCPEnvironment):
287
+ SUPPORTS_CONCURRENT_SESSIONS: bool = True # parallel training episodes
288
+
289
+ def reset(self, task_id=None, seed=None, **kwargs) -> Observation: ...
290
+ def _step_impl(self, action: Action, **kwargs) -> Observation: ...
291
+ ```
292
+
293
+ Agents interact via MCP tools, never via direct method calls. The reward is computed entirely inside the environment — the agent cannot inspect or manipulate the judge. This aligns with OpenEnv's core invariant: *rewards inside environment, not external*.
294
+
295
+ The mock service starts as an embedded subprocess on `reset()` and is torn down with the environment, making each Docker container self-contained.
296
+
297
+ ---
298
+
299
+ ## Running the Environment
300
+
301
+ ```bash
302
+ # Clone the repo (environment + agent are in one repo)
303
+ git clone https://github.com/yonghongzhang-io/comtrade-openenv
304
+ cd comtrade-openenv
305
+
306
+ # Install OpenEnv framework
307
+ pip install openenv-core[core]
308
+
309
+ # Rule-based smoke test — no LLM, no external server needed
310
+ # (InProcessEnvClient auto-starts mock service in-process)
311
+ python agent/smoke_test.py --task T1_single_page
312
+ python agent/smoke_test.py --task T7_totals_trap
313
+ python agent/smoke_test.py --task T8_mixed_faults
314
+
315
+ # Run unit + integration tests
316
+ pip install pytest
317
+ python -m pytest agent/tests/ -v
318
+
319
+ # Train with GRPO via local Ollama/vLLM (rollout-only, no GPU required)
320
+ python agent/train_grpo.py \
321
+ --api-url http://localhost:11434/v1 \
322
+ --api-model qwen2.5:7b \
323
+ --num-iterations 200 \
324
+ --max-workers 4
325
+
326
+ # Train with gradient updates (requires GPU + HuggingFace model)
327
+ python agent/train_grpo.py \
328
+ --hf-model Qwen/Qwen2.5-7B-Instruct \
329
+ --num-iterations 200 \
330
+ --output-dir ./checkpoints
331
+ ```
332
+
333
+ No external OpenEnv server is needed — `InProcessEnvClient` wraps the environment directly, with parallel rollout support via `ThreadPoolExecutor`.
334
+
335
+ ---
336
+
337
+ ## Design Decisions and Lessons Learned
338
+
339
+ **Stateless mock service is essential.** The first implementation used per-session state in the mock service, which caused race conditions when multiple agents ran concurrently during GRPO rollouts. Switching to stateless `/api/data` with per-task `_API_STATE` dictionaries eliminated the issue entirely.
340
+
341
+ **Three tools is the right abstraction.** Early prototypes had separate tools for setting query parameters and for pagination. Collapsing to `get_task_info` + `fetch_page` + `submit_results` reduced token overhead and made the tool-use pattern easier for the model to learn.
342
+
343
+ **Protocol-level dedup beats prompt-level dedup.** Telling the model "deduplicate records" in the system prompt is fragile — the model may not track state correctly across long contexts. Instead, the agent loop handles dedup mechanically using a Python dict keyed by primary key. The model only needs to decide *when* to call which tool.
344
+
345
+ **Observability scoring drives good agent habits.** The 10-point observability dimension, which requires structured log entries (`task_id=`, `page=N`, `request=N`, `complete=true`), incentivizes the agent to maintain explicit execution state. This is valuable beyond scoring: structured logs are how real ETL pipelines are debugged.
346
+
347
+ ---
348
+
349
+ ## Links
350
+
351
+ - **Environment**: [github.com/yonghongzhang-io/comtrade-openenv](https://github.com/yonghongzhang-io/comtrade-openenv)
352
+ - **HF Space**: [huggingface.co/spaces/yonghongzhang/comtrade-env](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
353
+ - **Full competition repo**: [github.com/yonghongzhang-io/AIAgentCompetition-phase2](https://github.com/yonghongzhang-io/AIAgentCompetition-phase2)
354
+ - **OpenEnv framework**: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)