README / README.md
xiziqiao's picture
Update README.md
92e3d06 verified
---
sdk: static
---
# ToolGym
**ToolGym** is an **open-world tool-using environment** for *scalable agent testing and data curation*.
> Large tool pools • long-horizon workflows • wild constraints • unreliable tool states
---
## Quick links
- 🏆 **[Leaderboard](https://huggingface.co/spaces/ToolGym/leaderboard)**
- 📦 **Dataset(s)**: `/datasets/ToolGym/ToolGym`
- 📄 **[Paper]**(https://arxiv.org/abs/2601.06328)
- 💻 **[Code](https://github.com/Ziqiao-git/ToolGym)**
---
## Key highlights
- **5,571** validated tools (unified in **MCP format**)
- **204** real-world apps covered, from **276** MCP servers
- Long-horizon, constraint-dense tasks
- Avg. **28.5** tool-use rounds per task (**averaged across evaluated models**)
- A **State Controller** that injects realistic failures & drift
(timeouts, rate limits, transient unavailability, etc.)
- **Planner–Actor** agent framework
- ToolGym supports and releases data signals for **both**:
- **Planner**: deliberate reasoning, reflection, progress tracking, self-correction
- **Actor**: step-wise tool retrieval, invocation, and execution
- **Data-efficient training (experiment)**: we show strong gains using only **1,170** curated training samples
(this number refers to the *training subset used in our experiments*, not the full scale/upper bound of ToolGym as a data engine)
---
## What is ToolGym?
ToolGym is designed to close the gap between “clean” function-calling benchmarks and **messy real-world tool ecosystems**. It supports both:
- **Benchmarking**: stress-test agents on long, multi-tool workflows under constraints and failures
- **Data curation**: collect high-quality trajectories for training tool-using agents
---
## Core components
### 1) Tool universe (MCP)
We curate and validate a large library of production-like tools, then standardize them under a unified **Model Context Protocol (MCP)** interface so agents can call tools consistently across apps and servers.
### 2) Tool retrieval index
Because open-world tool selection is the real challenge, ToolGym includes a retrieval layer so agents can search tools using natural language queries and load relevant tools on demand.
### 3) Task creation engine
ToolGym synthesizes **long-horizon, multi-tool workflows** that resemble real user requests:
- multi-step dependencies
- cross-app orchestration
- dense constraints (format, ordering, trade-offs, verification requirements, etc.)
### 4) State Controller (robustness testing)
To go beyond “happy-path” evaluation, ToolGym introduces a controllable middleware that can inject:
- tool-level failures (timeouts, temporary unavailability)
- state-level drift (corrupted/delayed results, expired sessions)
- constraint changes mid-execution (updated preferences, shifting deadlines)
### 5) Evaluation protocol
ToolGym evaluates agents on multiple axes, including:
- **Answer quality** (completeness, grounding)
- **Robustness** (schema compliance, recovery, flexibility)
- **Constraint following** (format + other constraints)
- **Planning** (goal decomposition, progress tracking, efficiency)
### 6) Planner–Actor decomposition
To better handle long-horizon objectives and error-prone tool ecosystems, we separate agent behavior into:
- **Planner**: global reasoning & self-correction (keeps the agent aligned over long trajectories)
- **Actor**: efficient step-by-step execution (retrieval → tool call → observe → iterate)
---
## Leaderboard
We maintain a public leaderboard for ToolGym.
➡️ **[Leaderboard link](https://huggingface.co/spaces/ToolGym/leaderboard)**
---
## License
- This organization and its public repos are released under the **MIT** license unless otherwise specified in each repo.
---
## Contributing
Community contributions are welcome:
- Open a discussion: `/datasets/ToolGym/ToolGym/discussions`
- Submit PRs to the relevant repo (dataset / code / leaderboard Space)
---
## Contact
For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.