ToolGym

community

AI & ML interests

LLM Agent

Recent Activity

LQ67  updated a model about 1 month ago
ToolGym/qwen3-8b-toolgym
LQ67  published a model about 1 month ago
ToolGym/qwen3-8b-toolgym
LQ67  updated a model about 1 month ago
ToolGym/qwen2.5-7b-instruct-toolgym
View all activity

ToolGym

ToolGym is an open-world tool-using environment for scalable agent testing and data curation.

Large tool pools • long-horizon workflows • wild constraints • unreliable tool states


Quick links


Key highlights

  • 5,571 validated tools (unified in MCP format)
  • 204 real-world apps covered, from 276 MCP servers
  • Long-horizon, constraint-dense tasks
    • Avg. 28.5 tool-use rounds per task (averaged across evaluated models)
  • A State Controller that injects realistic failures & drift
    (timeouts, rate limits, transient unavailability, etc.)
  • Planner–Actor agent framework
    • ToolGym supports and releases data signals for both:
      • Planner: deliberate reasoning, reflection, progress tracking, self-correction
      • Actor: step-wise tool retrieval, invocation, and execution
  • Data-efficient training (experiment): we show strong gains using only 1,170 curated training samples
    (this number refers to the training subset used in our experiments, not the full scale/upper bound of ToolGym as a data engine)

What is ToolGym?

ToolGym is designed to close the gap between “clean” function-calling benchmarks and messy real-world tool ecosystems. It supports both:

  • Benchmarking: stress-test agents on long, multi-tool workflows under constraints and failures
  • Data curation: collect high-quality trajectories for training tool-using agents

Core components

1) Tool universe (MCP)

We curate and validate a large library of production-like tools, then standardize them under a unified Model Context Protocol (MCP) interface so agents can call tools consistently across apps and servers.

2) Tool retrieval index

Because open-world tool selection is the real challenge, ToolGym includes a retrieval layer so agents can search tools using natural language queries and load relevant tools on demand.

3) Task creation engine

ToolGym synthesizes long-horizon, multi-tool workflows that resemble real user requests:

  • multi-step dependencies
  • cross-app orchestration
  • dense constraints (format, ordering, trade-offs, verification requirements, etc.)

4) State Controller (robustness testing)

To go beyond “happy-path” evaluation, ToolGym introduces a controllable middleware that can inject:

  • tool-level failures (timeouts, temporary unavailability)
  • state-level drift (corrupted/delayed results, expired sessions)
  • constraint changes mid-execution (updated preferences, shifting deadlines)

5) Evaluation protocol

ToolGym evaluates agents on multiple axes, including:

  • Answer quality (completeness, grounding)
  • Robustness (schema compliance, recovery, flexibility)
  • Constraint following (format + other constraints)
  • Planning (goal decomposition, progress tracking, efficiency)

6) Planner–Actor decomposition

To better handle long-horizon objectives and error-prone tool ecosystems, we separate agent behavior into:

  • Planner: global reasoning & self-correction (keeps the agent aligned over long trajectories)
  • Actor: efficient step-by-step execution (retrieval → tool call → observe → iterate)

Leaderboard

We maintain a public leaderboard for ToolGym.
➡️ Leaderboard link


License

  • This organization and its public repos are released under the MIT license unless otherwise specified in each repo.

Contributing

Community contributions are welcome:

  • Open a discussion: /datasets/ToolGym/ToolGym/discussions
  • Submit PRs to the relevant repo (dataset / code / leaderboard Space)

Contact

For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.