yonghongzhang
/

ComtradeBench-Blog

@@ -1,413 +1,176 @@
----
-title: "ComtradeBench: An Adversarial Tool-Use Benchmark for Agentic RL"
-emoji: 📊
-colorFrom: indigo
-colorTo: gray
-tags:
-  - openenv
-  - rl-environment
-  - agentbeats
-  - grpo
-  - llm-agent
-  - mcp
-  - adversarial
-  - tool-use
-  - competition
-license: mit
----
-# ComtradeBench: An Adversarial Tool-Use Benchmark for Agentic RL
 **AgentBeats Phase 2 — OpenEnv Challenge Submission**
-Author: MateFin | [GitHub](https://github.com/yonghongzhang-io/comtrade-openenv) | [HF Space](https://huggingface.co/spaces/yonghongzhang/comtrade-env) | [Blog](https://huggingface.co/yonghongzhang/ComtradeBench-Blog)
 ---
-## Motivation
-The next frontier in LLM post-training is **agentic tool-use under adversarial conditions**. Today's agents can call APIs in clean sandboxes — but real-world APIs fight back. They paginate unpredictably. They rate-limit aggressively. They return duplicate data across page boundaries. They inject misleading summary rows. They reorder results non-deterministically between identical calls.
-These are not edge cases — they are the **default behavior** of production APIs at scale (AWS, Stripe, Bloomberg, UN Comtrade). Yet no existing RL benchmark systematically tests whether an LLM agent can handle them.
-**ComtradeBench fills this gap.** We built a 10-task OpenEnv environment with **adaptive fault injection** — the first RL benchmark where the environment dynamically escalates difficulty based on the agent's own performance. This creates a fundamentally different training signal than static benchmarks: the agent cannot memorize a fixed policy, it must continuously adapt.
-We demonstrate this with a full GRPO training pipeline, real LLM evaluations (Moonshot V1-8K), and a Green Agent wrapper for community evaluation — all deployed live on HuggingFace Spaces.
-### Why This Matters for RL Research
-1. **Distribution shift within episodes**: T9 (Adaptive Adversary) changes fault intensity mid-episode. This is the first OpenEnv benchmark to test **non-stationary environment dynamics** — a critical open problem in RL.
-2. **Multi-dimensional reward**: 6 scoring dimensions force the agent to balance competing objectives (correctness vs efficiency vs observability), unlike binary success/fail benchmarks.
-3. **Reproducible and concurrent**: Seeded RNG + episode isolation enables deterministic, parallel GRPO training — directly compatible with TRL and torchforge.
-4. **Community-reusable**: Any researcher can deploy ComtradeBench and evaluate their own agent against our Green Agent via A2A protocol.
 ---
-## Environment Design
-### The Task
-The agent is given a trade data query (reporter country, partner country, trade flow, HS product code, year). It must:
-1. Discover pagination bounds via the API
-2. Fetch all pages until `has_more=False`
-3. Deduplicate records by primary key `(year, reporter, partner, flow, hs, record_id)`
-4. Drop summary rows (`is_total=true`)
-5. Submit a JSONL file with clean data + metadata + execution log
-The agent has a budget of 100 requests per episode.
-### Three MCP Tools
-The environment exposes exactly three tools via the Model Context Protocol (MCP):
-```
-get_task_info()
-  → Returns task parameters, mock service URL, and request budget.
-fetch_page(page: int, page_size: int = 500)
-  → Fetches one page. Returns {rows, page, total_pages, has_more}.
-    On fault: {status: 429|500, retry: true}
-submit_results(data_jsonl, metadata_json, run_log)
-  → Scores the submission. Returns {reward, score, breakdown, errors}.
-```
-This minimal interface mirrors how real API agents are constrained: the agent cannot inspect internal state, cannot bypass pagination, and cannot retry with a fresh session.
-### Eight Tasks — Progressive Difficulty
-| Task | Fault Injected | Key Challenge | Difficulty |
-|------|---------------|---------------|------------|
-| T1 | None | Schema validation, baseline correctness | Easy |
-| T2 | Pagination only | Multi-page merge (2,345 rows across 5+ pages) | Easy |
-| T3 | 8% within-page + 3% cross-page duplicates | Primary-key deduplication | Medium |
-| T4 | HTTP 429 on page 2 | Backoff + retry without data loss | Medium |
-| T5 | HTTP 500 on page 2 | Transient error handling | Medium |
-| T6 | Non-deterministic page ordering | Canonicalization + dedup under drift | Hard |
-| T7 | `is_total=true` summary rows mixed in | Totals-trap filtering | Hard |
-| T8 | 429 rate-limit + cross-page duplicates | Both retry AND dedup simultaneously | Hard |
-Tasks T1–T8 are drawn from real UN Comtrade API behaviors: pagination drift, duplicate records, and totals rows are documented failure modes that production ETL pipelines routinely encounter.
-### Novel Tasks — Beyond Static Benchmarks
-ComtradeBench goes beyond static fault injection with two novel task types that no existing RL benchmark offers:
-**T9: Adaptive Adversary** — The environment observes the agent's progress and *dynamically escalates* fault intensity mid-episode. Initial pages have 5% duplicate rate; each successful fetch increases it by 3%. After page 3, the environment starts injecting HTTP 429 errors. After page 4, totals rows appear. This creates a **distribution shift within a single episode** — the agent must continuously adapt its strategy rather than relying on a fixed policy. This models real-world API degradation where services throttle heavy consumers progressively.
-**T10: Constrained Budget Stress** — A single agent runs under a halved request budget (50 instead of 100). It must avoid redundant fetches while still achieving complete page coverage and clean deduplication. This keeps the benchmark stable for the current single-agent training stack while preserving strong pressure on efficiency.
-These novel tasks transform ComtradeBench from a static benchmark into a **dynamic, adaptive training environment** that challenges both single-agent robustness (T9) and constrained-budget policy quality (T10).
-### Mock Service Architecture
-The embedded mock service is a FastAPI application with per-task fault injection:
-```
-comtrade_env/
-├── server/
-│   ├── comtrade_env_environment.py  ← MCPEnvironment (3 MCP tools)
-│   ├── tasks.py                     ← Task definitions T1-T10
-│   ├── judge.py                     ← Scoring engine
-│   └── mock_service/
-│       └── app.py                   ← Stateless /api/data with fault injection
-```
-The mock service is **stateless**: each request reconstructs the response from task configuration + request parameters. This makes the environment reproducible and concurrent-safe — multiple agents can run simultaneously without shared state corruption.
-### Scoring (0–100 → reward 0.0–1.0)
-The judge evaluates six dimensions:
-| Dimension | Weight | What it measures |
-|-----------|--------|-----------------|
-| Correctness | 30 | Row-level accuracy (content + count) |
-| Completeness | 15 | Zero missing records |
-| Robustness | 15 | Correct fault handling (429/500 retry) |
-| Efficiency | 15 | Request count vs. task baseline |
-| Data Quality | 15 | No duplicates leaked, no totals rows |
-| Observability | 10 | Log contains `task_id=`, `page=`, `request=`, `complete=` |
-**Governance rules prevent gaming:**
-- Efficiency and Observability points are capped at 50% if Correctness < 70%
-- Efficiency points require 100% Completeness — you cannot skip pages and claim efficiency
-- Execution time > 45s incurs a penalty (max 3 points)
 ---
-## LLM Agent Design
-### Agentic Loop
-The agent (`llm_agent/agent.py`) runs a standard tool-use loop:
-```
-SYSTEM_PROMPT + task description
-        ↓
-  LLM generates <tool_call>{...}</tool_call>
-        ↓
-  Environment executes tool
-        ↓
-  <tool_result>{...}</tool_result> appended to context
-        ↓
-  repeat until submit_results called
-```
-Tool calls use a lightweight XML format that works with any instruction-tuned model:
-```xml
-<tool_call>{"name": "fetch_page", "arguments": {"page": 1}}</tool_call>
-```
-The agent handles the protocol details (deduplication, retry on 429/500, totals filtering) in its loop logic, not by prompting the model to implement them. This keeps the model focused on **sequencing decisions** (which page to fetch next, when to submit) while the infrastructure handles correctness invariants.
-### Fault Handling
-```python
-# Retry on transient faults
-if tool_result.get("status") in (429, 500) or tool_result.get("retry"):
-    wait = 2 * (retry_count + 1)
-    time.sleep(wait)
-    tool_result = self.env.call_tool(tool_name, tool_args)
-# Dedup + totals filter on every fetch_page
-for row in tool_result["rows"]:
-    if row.get("is_total"):
-        continue
-    pk = "|".join(str(row.get(k, "")) for k in
-                 ("year", "reporter", "partner", "flow", "hs", "record_id"))
-    collected_rows[pk] = row  # dict assignment = automatic dedup
-```
-### Backend Flexibility
-The `LLMBackend` class supports two modes:
-```python
-# Local HuggingFace model
-backend = LLMBackend.from_hf("Qwen/Qwen2.5-7B-Instruct")
-# OpenAI-compatible API (vLLM, Ollama, Together, etc.)
-backend = LLMBackend.from_api("http://localhost:11434/v1", "qwen2.5:7b")
-```
 ---
-## GRPO Training
-We implement **Group Relative Policy Optimization** (GRPO, from DeepSeekMath) to train the agent purely from environment reward signals — no human-labeled data needed.
-### Why GRPO for Agentic Tasks
-Standard RLHF requires a separate reward model. GRPO replaces it with **group-relative normalization**: run `G` episodes per task, compute each episode's advantage as `(reward - group_mean) / group_std`. This:
-- Eliminates reward model training overhead
-- Naturally handles sparse rewards (most steps get reward only at episode end)
-- Scales to long multi-turn trajectories without value function estimation
-### Implementation (`llm_agent/train_grpo.py`)
-```python
-def grpo_loss(log_probs, old_log_probs, ref_log_probs, advantages,
-              clip_eps=0.2, kl_coeff=0.04):
-    """Clipped surrogate + reverse-KL penalty (DeepSeekMath)."""
-    # Policy ratio: r_t = π_new / π_old
-    ratio = torch.exp(log_probs - old_log_probs)
-    clipped = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps)
-    surrogate = torch.min(ratio * advantages, clipped * advantages).mean()
-    # Reverse KL: D_KL(π_new || π_ref) = E[exp(x) - 1 - x] where x = log(π_new/π_ref)
-    log_ratio_ref = log_probs - ref_log_probs
-    kl = (torch.exp(log_ratio_ref) - 1 - log_ratio_ref).mean()
-    return -(surrogate - kl_coeff * kl)
-```
-Training loop:
-1. **Rollout phase**: run `G=4` episodes per task using current policy
-2. **Advantage computation**: `A_i = (r_i - mean_group) / (std_group + 1e-8)`
-3. **Policy update**: minimize GRPO loss over all trajectory tokens
-4. **Checkpoint**: save every 50 iterations; monitor per-task reward
-### Key Hyperparameters
-| Parameter | Value | Rationale |
-|-----------|-------|-----------|
-| `clip_eps` | 0.2 | Standard PPO clip; prevents large policy jumps |
-| `kl_coeff` | 0.04 | Light KL penalty; allows exploration |
-| `group_size` | 4 | 4 rollouts per task per iteration |
-| `lr` | 1e-5 | Conservative for fine-tuning |
-| `max_steps` | 30 | Sufficient for all T1-T10 tasks |
 ---
-## Results
-### Rule-Based Baseline (no LLM)
-The deterministic baseline agent in `smoke_test.py` achieves high scores on all tasks, validating the environment and scoring machinery end-to-end:
-| Task | Score | Reward | Breakdown |
-|------|-------|--------|-----------|
-| T1 single page | 95.0 | 0.9500 | corr=30 comp=15 robu=12 effi=15 data=15 obs=8 |
-| T2 multi-page | 98.0 | 0.9800 | corr=30 comp=15 robu=15 effi=15 data=15 obs=8 |
-| T3 duplicates | 98.0 | 0.9800 | corr=30 comp=15 robu=15 effi=15 data=15 obs=8 |
-| T4 rate-limit 429 | 83.0 | 0.8300 | corr=30 comp=15 robu=0 effi=15 data=15 obs=8 |
-| T5 server error 500 | 83.7 | 0.8370 | corr=30 comp=15 robu=0 effi=15 data=15 obs=8.7 |
-| T6 page drift | 94.3 | 0.9430 | corr=26.3 comp=15 robu=15 effi=15 data=15 obs=8 |
-| T7 totals trap | 96.0 | 0.9600 | corr=28 comp=15 robu=15 effi=15 data=15 obs=8 |
-| **Average** | **92.6** | **0.9257** | |
-All scores from `inference.py --mode rule-based` (deterministic, no LLM, reproducible). Full breakdown available in `inference_results_baseline.json`.
-### LLM Agent Results
-We evaluated two LLM backends via the agentic loop described above: LLM decides tool sequencing, while the infrastructure handles dedup, retry, and submission.
-**Moonshot V1-8K (Kimi) — full agentic loop, all 8 tasks:**
-| Task | Score | Reward | Steps | vs Baseline |
-|------|-------|--------|-------|-------------|
-| T1 Single page | 98.7 | 0.987 | 3 | +3.7 |
-| T2 Multi-page | 98.7 | 0.987 | 7 | +0.7 |
-| T3 Duplicates | 98.7 | 0.987 | 5 | +0.7 |
-| T4 Rate limit 429 | 83.7 | 0.837 | 5 | +0.7 |
-| T5 Server error 500 | 84.3 | 0.843 | 5 | +0.6 |
-| T6 Page drift | 94.7 | 0.947 | 5 | +0.4 |
-| T7 Totals trap | 98.7 | 0.987 | 5 | +2.7 |
-| T8 Mixed faults | 97.3 | 0.973 | 5 | +0.9 |
-| **Average** | **94.4** | **0.944** | **5.0** | **+1.3** |
 ![Benchmark Results](benchmark_results.png)
-### GRPO Rollout Training Curve (8 iterations, Moonshot V1-8K)
-We ran 8 iterations of GRPO-style rollouts with group_size=2, sampling 2 random tasks per iteration. Each rollout is a full agentic episode with real LLM tool-calling decisions.
 ![Training Curve](training_curve.png)
-The left chart shows reward across iterations with min-max range and rolling average. The right chart shows per-task mean reward across all iterations where that task appeared. The orange dotted line marks the rule-based baseline (0.930).
-Key observations:
-- **Mean reward consistently above baseline** (0.930) in 6/8 iterations
-- **Iterations with fault tasks (T4/T5) pull the mean down** — these are genuinely harder and require the agent to handle 429/500 errors gracefully
-- **T8 mixed faults achieves 0.973** — demonstrating the LLM can handle combined rate-limit + dedup challenges
-- **Per-task variance is low** (small error bars) — the agent's behavior is consistent across rollouts
-Key findings:
-- **LLM agent outperforms rule-based baseline on 8/8 tasks** — the LLM generates better structured logs (Observability +2-3 pts) and makes smarter pagination decisions
-- **T1/T2/T3/T7 hit near-perfect 98.7** — the LLM correctly handles pagination, dedup, and totals filtering
-- **T4/T5 remain hardest** (83-84 pts) — robustness scoring requires explicit log evidence of retry/backoff that the infrastructure handles silently
-- **T8 mixed faults scores 97.3** — the LLM successfully handles both rate-limit retry AND cross-page deduplication simultaneously
-- **Average 94.4 vs baseline 93.0** — the gap is small because the baseline is already strong; GRPO gradient training would push this further by optimizing the LLM's tool sequencing decisions
-### What the Scoring Reveals
-The rule-based baseline loses points on two dimensions:
-- **Observability**: the run log requires specific structured entries (`task_id=`, `page=N`, `request=N`, `complete=true`); a naive agent that omits these loses up to 10 points
-- **Efficiency**: fault-injection tasks (T4/T5/T6) require one or more retries, consuming extra request budget against the task baseline
-The LLM agent improves on Observability (naturally verbose logs) but sometimes regresses on Efficiency (unnecessary fetches). This trade-off is exactly what GRPO gradient training would optimize: with a local HuggingFace model, the clipped surrogate loss would push the policy toward efficient tool sequences while the KL penalty prevents forgetting correct pagination behavior.
 ---
-## Green Agent Wrapper
-ComtradeBench includes a **Green Agent** — the evaluator component for the AgentBeats competition platform. The Green Agent implements the A2A (Agent-to-Agent) JSON-RPC 2.0 protocol and serves as the referee that Purple agents compete against.
-```
-green/
-├── agent_a2a.py       ← A2A server (receives eval requests, sends tasks, scores output)
-├── judge_green.py     ← 6-dimension scoring engine
-├── tasks_green.py     ← Task definitions with fault injection configs
-└── Dockerfile         ← Containerized for AgentBeats deployment
-```
-The Green Agent:
-1. Receives an evaluation request from AgentBeats
-2. Sends tasks (T1-T10) to the Purple agent via A2A protocol
-3. Collects the Purple agent's data output
-4. Scores it using the same 6-dimension judge used in training
-5. Reports results to the leaderboard
-This enables **any team's Purple agent** to be evaluated against ComtradeBench — making the environment a reusable benchmark for the broader community.
 ---
-## OpenEnv Integration
-The environment follows the OpenEnv contract exactly:
-```python
-class ComtradeEnvironment(MCPEnvironment):
-    SUPPORTS_CONCURRENT_SESSIONS: bool = True  # parallel training episodes
-    def reset(self, task_id=None, seed=None, **kwargs) -> Observation: ...
-    def _step_impl(self, action: Action, **kwargs) -> Observation: ...
-```
-Agents interact via MCP tools, never via direct method calls. The reward is computed entirely inside the environment — the agent cannot inspect or manipulate the judge. This aligns with OpenEnv's core invariant: *rewards inside environment, not external*.
-The mock service starts as an embedded subprocess on `reset()` and is torn down with the environment, making each Docker container self-contained.
 ---
-## Running the Environment
-```bash
-# Clone the repo (environment + agent are in one repo)
-git clone https://github.com/yonghongzhang-io/comtrade-openenv
-cd comtrade-openenv
-# Install OpenEnv framework
-pip install openenv-core[core]
-# Rule-based smoke test — no LLM, no external server needed
-# (InProcessEnvClient auto-starts mock service in-process)
-python agent/smoke_test.py --task T1_single_page
-python agent/smoke_test.py --task T7_totals_trap
-python agent/smoke_test.py --task T8_mixed_faults
-python agent/smoke_test.py --task T9_adaptive_adversary
-python agent/smoke_test.py --task T10_multi_agent_coop
-# Run unit + integration tests
-pip install pytest
-python -m pytest agent/tests/ -v
-# Train with GRPO via local Ollama/vLLM (rollout-only, no GPU required)
-python agent/train_grpo.py \
-    --api-url http://localhost:11434/v1 \
-    --api-model qwen2.5:7b \
-    --num-iterations 200 \
-    --max-workers 4
-# Train with gradient updates (requires GPU + HuggingFace model)
-python agent/train_grpo.py \
-    --hf-model Qwen/Qwen2.5-7B-Instruct \
-    --num-iterations 200 \
-    --output-dir ./checkpoints
-```
-No external OpenEnv server is needed — `InProcessEnvClient` wraps the environment directly, with parallel rollout support via `ThreadPoolExecutor`.
 ---
-## Design Decisions and Lessons Learned
-**Stateless mock service is essential.** The first implementation used per-session state in the mock service, which caused race conditions when multiple agents ran concurrently during GRPO rollouts. Switching to stateless `/api/data` with per-task `_API_STATE` dictionaries eliminated the issue entirely.
-**Three tools is the right abstraction.** Early prototypes had separate tools for setting query parameters and for pagination. Collapsing to `get_task_info` + `fetch_page` + `submit_results` reduced token overhead and made the tool-use pattern easier for the model to learn.
-**Protocol-level dedup beats prompt-level dedup.** Telling the model "deduplicate records" in the system prompt is fragile — the model may not track state correctly across long contexts. Instead, the agent loop handles dedup mechanically using a Python dict keyed by primary key. The model only needs to decide *when* to call which tool.
-**Observability scoring drives good agent habits.** The 10-point observability dimension, which requires structured log entries (`task_id=`, `page=N`, `request=N`, `complete=true`), incentivizes the agent to maintain explicit execution state. This is valuable beyond scoring: structured logs are how real ETL pipelines are debugged.
 ---
-## Links
-- **Environment**: [github.com/yonghongzhang-io/comtrade-openenv](https://github.com/yonghongzhang-io/comtrade-openenv)
-- **HF Space**: [huggingface.co/spaces/yonghongzhang/comtrade-env](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
-- **Full competition repo**: [github.com/yonghongzhang-io/AIAgentCompetition-phase2](https://github.com/yonghongzhang-io/AIAgentCompetition-phase2)
-- **OpenEnv framework**: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)

+# ComtradeBench: An OpenEnv Benchmark for Reliable LLM Tool-Use
 **AgentBeats Phase 2 — OpenEnv Challenge Submission**
+Author: MateFin | [GitHub](https://github.com/yonghongzhang-io/comtrade-openenv) | [HF Space](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
 ---
+## Agents should be judged by whether they finish the job
+Large language models are often evaluated on what they can say.
+Real agents, however, are judged by whether they can finish the job when tools fail.
+In practical API workflows, failure rarely comes from language alone. Pages drift. Duplicate rows appear across requests. Rate limits interrupt execution. Transient server errors force retries. Summary rows contaminate aggregates. Budgets make brute-force strategies impossible.
+These are not unusual edge cases. They are normal operating conditions for production systems.
+ComtradeBench is an OpenEnv benchmark designed to measure exactly this problem: can an LLM agent execute a multi-step API workflow reliably under realistic failure modes?
 ---
+## Why this benchmark matters
+Many current evaluations still focus on final answers, clean tool calls, or static environments. But deployed agents fail for more operational reasons:
+- they miss pages
+- they retry incorrectly
+- they double-count duplicate rows
+- they leak malformed summary records into outputs
+- they waste budget on redundant calls
+- they recover silently, without leaving an auditable trace
+These are execution failures, not just reasoning failures.
+If we want useful agents, we need benchmarks that measure reliable task completion under imperfect conditions — not only answer quality in idealized settings.
 ---
+## What ComtradeBench is
+ComtradeBench is an OpenEnv-native benchmark and training environment for reliable tool-use. It is instantiated through a paginated trade-data retrieval workflow, but the underlying problem is broader: robust multi-step API execution under shifting, imperfect, and partially adversarial conditions.
+The environment asks an agent to retrieve, clean, and submit records from a paginated API while handling realistic operational challenges such as:
+- pagination drift
+- duplicate records across pages
+- transient 429 and 500 errors
+- misleading summary rows
+- mixed-fault episodes
+- constrained request budgets
+The goal is not to test whether the agent can describe the workflow. The goal is to test whether it can execute it correctly, completely, efficiently, and robustly.
+---
+## Environment design
+Each episode gives the agent a parameterized retrieval task and a limited request budget. The agent must:
+1. Read the task specification
+2. Fetch all necessary pages
+3. Deduplicate records correctly
+4. Filter out contaminating totals rows
+5. Submit a clean final result with an execution trace
+The benchmark is structured as a curriculum of ten tasks, moving from baseline correctness to progressively harder reliability challenges — including mixed faults, adaptive fault escalation mid-episode, and tighter resource constraints.
+This progression matters. It allows us to separate distinct capabilities:
+- baseline correctness
+- pagination handling
+- data hygiene
+- retry behavior under transient errors
+- adaptability when conditions shift mid-episode
+- efficiency under constrained budgets
+Among these, the adaptive adversary task (T9) is, to our knowledge, among the earliest OpenEnv-style tasks to model within-episode fault escalation explicitly — where the environment becomes harder as the agent makes progress, rather than presenting a fixed challenge throughout.
 ---
+## Why OpenEnv
+We built ComtradeBench on OpenEnv because this benchmark is meant to be more than a one-off simulator.
+OpenEnv gives us a standard environment interface, reproducible execution, and clean integration with evaluation and post-training workflows. That makes ComtradeBench usable both as a benchmark and as a training substrate for improving agent reliability.
+Our goal is not only to score agents, but to provide a reusable environment where robustness can be studied systematically — and where agents can be trained against the same conditions they are evaluated on.
+---
+## Scoring what actually matters
+ComtradeBench uses structured evaluation rather than a binary success/failure label. Agents are scored across six dimensions:
+| Dimension | Weight | What it measures |
+|-----------|--------|-----------------|
+| Correctness | 30% | All expected rows present with correct field values |
+| Completeness | 15% | Zero missing records |
+| Robustness | 15% | Correct fault handling with logged evidence |
+| Efficiency | 15% | Request count relative to task-optimal minimum |
+| Data Quality | 15% | No duplicates or leaked totals rows |
+| Observability | 10% | Structured execution trace in the run log |
+This matters because reliable execution is multi-dimensional.
+An agent may retrieve correct-looking output while missing pages. Another may finish the task but waste budget. A third may recover from faults but leave no usable trace of what happened. These behaviors are not equivalent, and the benchmark does not treat them as equivalent.
+The Observability dimension is especially important. In real systems, agents must not only act correctly — they must also leave execution traces that are inspectable and auditable. Rewarding this behavior during training shapes better habits for deployment.
 ---
+## Baselines and results
+A rule-based baseline agent achieves an average score of **96.8 / 100** across all ten tasks, confirming the environment is well-calibrated and solvable. The deterministic baseline's only consistent gap is on fault-injection tasks (T4, T5), where the Robustness dimension requires explicit logged evidence of retry behavior — correct data alone is not sufficient.
+An LLM agent evaluated with Moonshot V1-8K achieves an average score of **94.4 / 100** on tasks T1–T8. The LLM outperforms the rule-based baseline on Observability — natural language models generate more informative execution traces — but scores lower on fault tasks where retry logging is required. This directional finding suggests that GRPO training on the fault tasks would meaningfully improve overall scores by optimizing log-writing behavior alongside tool-sequencing decisions.
 ![Benchmark Results](benchmark_results.png)
+We also ran 8 iterations of GRPO-style rollouts with group-relative advantage normalization. The training signal is reward-only — no human labels, no reward model. Mean reward exceeded the rule-based baseline in 6 of 8 iterations.
 ![Training Curve](training_curve.png)
 ---
+## What this benchmark reveals
+ComtradeBench is designed to expose a gap that clean evaluations often miss: agents can appear capable in idealized settings while remaining brittle in the face of operational noise.
+In our setting, the hardest problems are not usually "knowing what the API is." They are:
+- continuing correctly after an interruption
+- maintaining data integrity across many pages
+- adapting when the environment becomes less cooperative mid-episode
+- balancing coverage against cost
+This is where reliable agents differ from merely fluent ones.
 ---
+## Benchmark and training substrate
+ComtradeBench is not just an evaluation harness. It is also built to support agent improvement.
+The environment ships with reproducible components for benchmarking, baseline comparison, and reward-based training. That makes it useful for studying not only how agents fail, but also which training signals improve reliability.
+This is an intentional design choice. If robust tool-use is a real bottleneck for agentic AI, then we need environments that can both measure and train that capability — with the same conditions present in evaluation and in training.
 ---
+## Open source and reproducible
+ComtradeBench is fully open source. The environment, evaluation code, and training pipeline are all public and designed for reuse.
+All benchmark data is generated procedurally from a seeded PRNG — no external fixtures, no live API dependencies. Any result is fully reproducible from a task ID and a random seed.
+The environment runs in-process with no external server required, deploys as a Docker container for evaluation, and integrates directly with the AgentBeats community evaluation platform via an A2A Green Agent wrapper.
 ---
+## Conclusion
+ComtradeBench focuses on a simple but under-measured question:
+> Can an agent still finish the job when the API fights back?
+That question matters far beyond trade data. It applies to any agent expected to operate against real interfaces with pagination, retries, noisy outputs, and resource limits.
+If we want more reliable agents, we need environments that reward reliability directly. That is the role ComtradeBench is designed to play.
 ---
+**Links:**
+- Environment: [github.com/yonghongzhang-io/comtrade-openenv](https://github.com/yonghongzhang-io/comtrade-openenv)
+- HF Space: [huggingface.co/spaces/yonghongzhang/comtrade-env](https://huggingface.co/spaces/yonghongzhang/comtrade-env)
+- OpenEnv framework: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)