ToolGym

community

Activity Feed

AI & ML interests

LLM Agent

Organization Card

Community About org cards

ToolGym

ToolGym is an open-world tool-using environment for scalable agent testing and data curation.

Large tool pools • long-horizon workflows • wild constraints • unreliable tool states

Quick links

🏆 Leaderboard
📦 Dataset(s): /datasets/ToolGym/ToolGym
📄 [Paper](https://arxiv.org/abs/2601.06328)
💻 Code

Key highlights

5,571 validated tools (unified in MCP format)
204 real-world apps covered, from 276 MCP servers
Long-horizon, constraint-dense tasks
- Avg. 28.5 tool-use rounds per task (averaged across evaluated models)
A State Controller that injects realistic failures & drift
(timeouts, rate limits, transient unavailability, etc.)
Planner–Actor agent framework
- ToolGym supports and releases data signals for both:
  - Planner: deliberate reasoning, reflection, progress tracking, self-correction
  - Actor: step-wise tool retrieval, invocation, and execution
Data-efficient training (experiment): we show strong gains using only 1,170 curated training samples
(this number refers to the training subset used in our experiments, not the full scale/upper bound of ToolGym as a data engine)

What is ToolGym?

ToolGym is designed to close the gap between “clean” function-calling benchmarks and messy real-world tool ecosystems. It supports both:

Benchmarking: stress-test agents on long, multi-tool workflows under constraints and failures
Data curation: collect high-quality trajectories for training tool-using agents

Core components

1) Tool universe (MCP)

We curate and validate a large library of production-like tools, then standardize them under a unified Model Context Protocol (MCP) interface so agents can call tools consistently across apps and servers.

2) Tool retrieval index

Because open-world tool selection is the real challenge, ToolGym includes a retrieval layer so agents can search tools using natural language queries and load relevant tools on demand.

3) Task creation engine

ToolGym synthesizes long-horizon, multi-tool workflows that resemble real user requests:

multi-step dependencies
cross-app orchestration
dense constraints (format, ordering, trade-offs, verification requirements, etc.)

4) State Controller (robustness testing)

To go beyond “happy-path” evaluation, ToolGym introduces a controllable middleware that can inject:

tool-level failures (timeouts, temporary unavailability)
state-level drift (corrupted/delayed results, expired sessions)
constraint changes mid-execution (updated preferences, shifting deadlines)

5) Evaluation protocol

ToolGym evaluates agents on multiple axes, including:

Answer quality (completeness, grounding)
Robustness (schema compliance, recovery, flexibility)
Constraint following (format + other constraints)
Planning (goal decomposition, progress tracking, efficiency)

6) Planner–Actor decomposition

To better handle long-horizon objectives and error-prone tool ecosystems, we separate agent behavior into:

Planner: global reasoning & self-correction (keeps the agent aligned over long trajectories)
Actor: efficient step-by-step execution (retrieval → tool call → observe → iterate)

Leaderboard

We maintain a public leaderboard for ToolGym.
➡️ Leaderboard link

License

This organization and its public repos are released under the MIT license unless otherwise specified in each repo.

Contributing

Community contributions are welcome:

Open a discussion: /datasets/ToolGym/ToolGym/discussions
Submit PRs to the relevant repo (dataset / code / leaderboard Space)

Contact

For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.

spaces 2

ToolGym Leaderboard

🏆

Official leaderboard for ToolGym.

models 2

ToolGym/qwen3-8b-toolgym

8B • Updated Jan 15 • 1

ToolGym/qwen2.5-7b-instruct-toolgym

8B • Updated Jan 15 • 1

datasets 4

AI & ML interests

Team members 3

ToolGym

Quick links

Key highlights

What is ToolGym?

Core components

1) Tool universe (MCP)

2) Tool retrieval index

3) Task creation engine

4) State Controller (robustness testing)

5) Evaluation protocol

6) Planner–Actor decomposition

Leaderboard

License

Contributing

Contact

spaces 2 Sort: Recently updated

ToolGym Leaderboard

models 2 Sort: Recently updated

datasets 4 Sort: Recently updated

spaces 2

models 2

datasets 4