Spaces:

ToolGym
/

README

Running

App Files Files Community

README / README.md

xiziqiao

Update README.md

92e3d06 verified about 1 month ago

preview code

raw

history blame contribute delete

4.15 kB

	---
	sdk: static
	---

	# ToolGym

	ToolGym is an open-world tool-using environment for scalable agent testing and data curation.

	> Large tool pools • long-horizon workflows • wild constraints • unreliable tool states

	---

	## Quick links

	- 🏆 [Leaderboard](https://huggingface.co/spaces/ToolGym/leaderboard)
	- 📦 Dataset(s): `/datasets/ToolGym/ToolGym`
	- 📄 [Paper](https://arxiv.org/abs/2601.06328)
	- 💻 [Code](https://github.com/Ziqiao-git/ToolGym)

	---

	## Key highlights

	- 5,571 validated tools (unified in MCP format)
	- 204 real-world apps covered, from 276 MCP servers
	- Long-horizon, constraint-dense tasks
	- Avg. 28.5 tool-use rounds per task (averaged across evaluated models)
	- A State Controller that injects realistic failures & drift
	(timeouts, rate limits, transient unavailability, etc.)
	- Planner–Actor agent framework
	- ToolGym supports and releases data signals for both:
	- Planner: deliberate reasoning, reflection, progress tracking, self-correction
	- Actor: step-wise tool retrieval, invocation, and execution
	- Data-efficient training (experiment): we show strong gains using only 1,170 curated training samples
	(this number refers to the training subset used in our experiments, not the full scale/upper bound of ToolGym as a data engine)

	---

	## What is ToolGym?

	ToolGym is designed to close the gap between “clean” function-calling benchmarks and messy real-world tool ecosystems. It supports both:

	- Benchmarking: stress-test agents on long, multi-tool workflows under constraints and failures
	- Data curation: collect high-quality trajectories for training tool-using agents

	---

	## Core components

	### 1) Tool universe (MCP)

	We curate and validate a large library of production-like tools, then standardize them under a unified Model Context Protocol (MCP) interface so agents can call tools consistently across apps and servers.

	### 2) Tool retrieval index

	Because open-world tool selection is the real challenge, ToolGym includes a retrieval layer so agents can search tools using natural language queries and load relevant tools on demand.

	### 3) Task creation engine

	ToolGym synthesizes long-horizon, multi-tool workflows that resemble real user requests:
	- multi-step dependencies
	- cross-app orchestration
	- dense constraints (format, ordering, trade-offs, verification requirements, etc.)

	### 4) State Controller (robustness testing)

	To go beyond “happy-path” evaluation, ToolGym introduces a controllable middleware that can inject:
	- tool-level failures (timeouts, temporary unavailability)
	- state-level drift (corrupted/delayed results, expired sessions)
	- constraint changes mid-execution (updated preferences, shifting deadlines)

	### 5) Evaluation protocol

	ToolGym evaluates agents on multiple axes, including:
	- Answer quality (completeness, grounding)
	- Robustness (schema compliance, recovery, flexibility)
	- Constraint following (format + other constraints)
	- Planning (goal decomposition, progress tracking, efficiency)

	### 6) Planner–Actor decomposition

	To better handle long-horizon objectives and error-prone tool ecosystems, we separate agent behavior into:
	- Planner: global reasoning & self-correction (keeps the agent aligned over long trajectories)
	- Actor: efficient step-by-step execution (retrieval → tool call → observe → iterate)

	---

	## Leaderboard

	We maintain a public leaderboard for ToolGym.
	➡️ [Leaderboard link](https://huggingface.co/spaces/ToolGym/leaderboard)

	---

	## License

	- This organization and its public repos are released under the MIT license unless otherwise specified in each repo.

	---

	## Contributing

	Community contributions are welcome:
	- Open a discussion: `/datasets/ToolGym/ToolGym/discussions`
	- Submit PRs to the relevant repo (dataset / code / leaderboard Space)

	---

	## Contact

	For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.