--- sdk: static --- # ToolGym **ToolGym** is an **open-world tool-using environment** for *scalable agent testing and data curation*. > Large tool pools • long-horizon workflows • wild constraints • unreliable tool states --- ## Quick links - 🏆 **[Leaderboard](https://huggingface.co/spaces/ToolGym/leaderboard)** - 📦 **Dataset(s)**: `/datasets/ToolGym/ToolGym` - 📄 **[Paper]**(https://arxiv.org/abs/2601.06328) - 💻 **[Code](https://github.com/Ziqiao-git/ToolGym)** --- ## Key highlights - **5,571** validated tools (unified in **MCP format**) - **204** real-world apps covered, from **276** MCP servers - Long-horizon, constraint-dense tasks - Avg. **28.5** tool-use rounds per task (**averaged across evaluated models**) - A **State Controller** that injects realistic failures & drift (timeouts, rate limits, transient unavailability, etc.) - **Planner–Actor** agent framework - ToolGym supports and releases data signals for **both**: - **Planner**: deliberate reasoning, reflection, progress tracking, self-correction - **Actor**: step-wise tool retrieval, invocation, and execution - **Data-efficient training (experiment)**: we show strong gains using only **1,170** curated training samples (this number refers to the *training subset used in our experiments*, not the full scale/upper bound of ToolGym as a data engine) --- ## What is ToolGym? ToolGym is designed to close the gap between “clean” function-calling benchmarks and **messy real-world tool ecosystems**. It supports both: - **Benchmarking**: stress-test agents on long, multi-tool workflows under constraints and failures - **Data curation**: collect high-quality trajectories for training tool-using agents --- ## Core components ### 1) Tool universe (MCP) We curate and validate a large library of production-like tools, then standardize them under a unified **Model Context Protocol (MCP)** interface so agents can call tools consistently across apps and servers. ### 2) Tool retrieval index Because open-world tool selection is the real challenge, ToolGym includes a retrieval layer so agents can search tools using natural language queries and load relevant tools on demand. ### 3) Task creation engine ToolGym synthesizes **long-horizon, multi-tool workflows** that resemble real user requests: - multi-step dependencies - cross-app orchestration - dense constraints (format, ordering, trade-offs, verification requirements, etc.) ### 4) State Controller (robustness testing) To go beyond “happy-path” evaluation, ToolGym introduces a controllable middleware that can inject: - tool-level failures (timeouts, temporary unavailability) - state-level drift (corrupted/delayed results, expired sessions) - constraint changes mid-execution (updated preferences, shifting deadlines) ### 5) Evaluation protocol ToolGym evaluates agents on multiple axes, including: - **Answer quality** (completeness, grounding) - **Robustness** (schema compliance, recovery, flexibility) - **Constraint following** (format + other constraints) - **Planning** (goal decomposition, progress tracking, efficiency) ### 6) Planner–Actor decomposition To better handle long-horizon objectives and error-prone tool ecosystems, we separate agent behavior into: - **Planner**: global reasoning & self-correction (keeps the agent aligned over long trajectories) - **Actor**: efficient step-by-step execution (retrieval → tool call → observe → iterate) --- ## Leaderboard We maintain a public leaderboard for ToolGym. ➡️ **[Leaderboard link](https://huggingface.co/spaces/ToolGym/leaderboard)** --- ## License - This organization and its public repos are released under the **MIT** license unless otherwise specified in each repo. --- ## Contributing Community contributions are welcome: - Open a discussion: `/datasets/ToolGym/ToolGym/discussions` - Submit PRs to the relevant repo (dataset / code / leaderboard Space) --- ## Contact For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.