|
|
--- |
|
|
sdk: static |
|
|
--- |
|
|
|
|
|
# ToolGym |
|
|
|
|
|
**ToolGym** is an **open-world tool-using environment** for *scalable agent testing and data curation*. |
|
|
|
|
|
> Large tool pools • long-horizon workflows • wild constraints • unreliable tool states |
|
|
|
|
|
--- |
|
|
|
|
|
## Quick links |
|
|
|
|
|
- 🏆 **[Leaderboard](https://huggingface.co/spaces/ToolGym/leaderboard)** |
|
|
- 📦 **Dataset(s)**: `/datasets/ToolGym/ToolGym` |
|
|
- 📄 **[Paper]**(https://arxiv.org/abs/2601.06328) |
|
|
- 💻 **[Code](https://github.com/Ziqiao-git/ToolGym)** |
|
|
|
|
|
--- |
|
|
|
|
|
## Key highlights |
|
|
|
|
|
- **5,571** validated tools (unified in **MCP format**) |
|
|
- **204** real-world apps covered, from **276** MCP servers |
|
|
- Long-horizon, constraint-dense tasks |
|
|
- Avg. **28.5** tool-use rounds per task (**averaged across evaluated models**) |
|
|
- A **State Controller** that injects realistic failures & drift |
|
|
(timeouts, rate limits, transient unavailability, etc.) |
|
|
- **Planner–Actor** agent framework |
|
|
- ToolGym supports and releases data signals for **both**: |
|
|
- **Planner**: deliberate reasoning, reflection, progress tracking, self-correction |
|
|
- **Actor**: step-wise tool retrieval, invocation, and execution |
|
|
- **Data-efficient training (experiment)**: we show strong gains using only **1,170** curated training samples |
|
|
(this number refers to the *training subset used in our experiments*, not the full scale/upper bound of ToolGym as a data engine) |
|
|
|
|
|
--- |
|
|
|
|
|
## What is ToolGym? |
|
|
|
|
|
ToolGym is designed to close the gap between “clean” function-calling benchmarks and **messy real-world tool ecosystems**. It supports both: |
|
|
|
|
|
- **Benchmarking**: stress-test agents on long, multi-tool workflows under constraints and failures |
|
|
- **Data curation**: collect high-quality trajectories for training tool-using agents |
|
|
|
|
|
--- |
|
|
|
|
|
## Core components |
|
|
|
|
|
### 1) Tool universe (MCP) |
|
|
|
|
|
We curate and validate a large library of production-like tools, then standardize them under a unified **Model Context Protocol (MCP)** interface so agents can call tools consistently across apps and servers. |
|
|
|
|
|
### 2) Tool retrieval index |
|
|
|
|
|
Because open-world tool selection is the real challenge, ToolGym includes a retrieval layer so agents can search tools using natural language queries and load relevant tools on demand. |
|
|
|
|
|
### 3) Task creation engine |
|
|
|
|
|
ToolGym synthesizes **long-horizon, multi-tool workflows** that resemble real user requests: |
|
|
- multi-step dependencies |
|
|
- cross-app orchestration |
|
|
- dense constraints (format, ordering, trade-offs, verification requirements, etc.) |
|
|
|
|
|
### 4) State Controller (robustness testing) |
|
|
|
|
|
To go beyond “happy-path” evaluation, ToolGym introduces a controllable middleware that can inject: |
|
|
- tool-level failures (timeouts, temporary unavailability) |
|
|
- state-level drift (corrupted/delayed results, expired sessions) |
|
|
- constraint changes mid-execution (updated preferences, shifting deadlines) |
|
|
|
|
|
### 5) Evaluation protocol |
|
|
|
|
|
ToolGym evaluates agents on multiple axes, including: |
|
|
- **Answer quality** (completeness, grounding) |
|
|
- **Robustness** (schema compliance, recovery, flexibility) |
|
|
- **Constraint following** (format + other constraints) |
|
|
- **Planning** (goal decomposition, progress tracking, efficiency) |
|
|
|
|
|
### 6) Planner–Actor decomposition |
|
|
|
|
|
To better handle long-horizon objectives and error-prone tool ecosystems, we separate agent behavior into: |
|
|
- **Planner**: global reasoning & self-correction (keeps the agent aligned over long trajectories) |
|
|
- **Actor**: efficient step-by-step execution (retrieval → tool call → observe → iterate) |
|
|
|
|
|
--- |
|
|
|
|
|
## Leaderboard |
|
|
|
|
|
We maintain a public leaderboard for ToolGym. |
|
|
➡️ **[Leaderboard link](https://huggingface.co/spaces/ToolGym/leaderboard)** |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
- This organization and its public repos are released under the **MIT** license unless otherwise specified in each repo. |
|
|
|
|
|
--- |
|
|
|
|
|
## Contributing |
|
|
|
|
|
Community contributions are welcome: |
|
|
- Open a discussion: `/datasets/ToolGym/ToolGym/discussions` |
|
|
- Submit PRs to the relevant repo (dataset / code / leaderboard Space) |
|
|
|
|
|
--- |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above. |