ToolGym
ToolGym is an open-world tool-using environment for scalable agent testing and data curation.
Large tool pools • long-horizon workflows • wild constraints • unreliable tool states
Quick links
Key highlights
- 5,571 validated tools (unified in MCP format)
- 204 real-world apps covered, from 276 MCP servers
- Long-horizon, constraint-dense tasks
- Avg. 28.5 tool-use rounds per task (averaged across evaluated models)
- A State Controller that injects realistic failures & drift
(timeouts, rate limits, transient unavailability, etc.)
- Planner–Actor agent framework
- ToolGym supports and releases data signals for both:
- Planner: deliberate reasoning, reflection, progress tracking, self-correction
- Actor: step-wise tool retrieval, invocation, and execution
- Data-efficient training (experiment): we show strong gains using only 1,170 curated training samples
(this number refers to the training subset used in our experiments, not the full scale/upper bound of ToolGym as a data engine)
What is ToolGym?
ToolGym is designed to close the gap between “clean” function-calling benchmarks and messy real-world tool ecosystems. It supports both:
- Benchmarking: stress-test agents on long, multi-tool workflows under constraints and failures
- Data curation: collect high-quality trajectories for training tool-using agents
Core components
1) Tool universe (MCP)
We curate and validate a large library of production-like tools, then standardize them under a unified Model Context Protocol (MCP) interface so agents can call tools consistently across apps and servers.
2) Tool retrieval index
Because open-world tool selection is the real challenge, ToolGym includes a retrieval layer so agents can search tools using natural language queries and load relevant tools on demand.
3) Task creation engine
ToolGym synthesizes long-horizon, multi-tool workflows that resemble real user requests:
- multi-step dependencies
- cross-app orchestration
- dense constraints (format, ordering, trade-offs, verification requirements, etc.)
4) State Controller (robustness testing)
To go beyond “happy-path” evaluation, ToolGym introduces a controllable middleware that can inject:
- tool-level failures (timeouts, temporary unavailability)
- state-level drift (corrupted/delayed results, expired sessions)
- constraint changes mid-execution (updated preferences, shifting deadlines)
5) Evaluation protocol
ToolGym evaluates agents on multiple axes, including:
- Answer quality (completeness, grounding)
- Robustness (schema compliance, recovery, flexibility)
- Constraint following (format + other constraints)
- Planning (goal decomposition, progress tracking, efficiency)
6) Planner–Actor decomposition
To better handle long-horizon objectives and error-prone tool ecosystems, we separate agent behavior into:
- Planner: global reasoning & self-correction (keeps the agent aligned over long trajectories)
- Actor: efficient step-by-step execution (retrieval → tool call → observe → iterate)
Leaderboard
We maintain a public leaderboard for ToolGym.
➡️ Leaderboard link
License
- This organization and its public repos are released under the MIT license unless otherwise specified in each repo.
Contributing
Community contributions are welcome:
- Open a discussion:
/datasets/ToolGym/ToolGym/discussions
- Submit PRs to the relevant repo (dataset / code / leaderboard Space)
Contact
For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.