File size: 4,147 Bytes
1a9d66f
 
 
ccf61fc
84ab7ca
 
 
 
 
 
 
 
 
 
e56bf87
84ab7ca
92e3d06
e56bf87
84ab7ca
 
 
ccf61fc
84ab7ca
 
 
ccf61fc
 
 
 
 
 
 
 
 
 
84ab7ca
a7c6e7e
84ab7ca
 
 
 
 
ccf61fc
 
84ab7ca
a7c6e7e
 
84ab7ca
 
 
 
 
 
 
 
 
 
 
 
ccf61fc
84ab7ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ccf61fc
 
 
 
 
 
84ab7ca
 
 
 
 
e56bf87
84ab7ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ccf61fc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
sdk: static
---

# ToolGym

**ToolGym** is an **open-world tool-using environment** for *scalable agent testing and data curation*.

> Large tool pools • long-horizon workflows • wild constraints • unreliable tool states  

---

## Quick links

- 🏆 **[Leaderboard](https://huggingface.co/spaces/ToolGym/leaderboard)**
- 📦 **Dataset(s)**: `/datasets/ToolGym/ToolGym`
- 📄 **[Paper]**(https://arxiv.org/abs/2601.06328)
- 💻 **[Code](https://github.com/Ziqiao-git/ToolGym)**

---

## Key highlights

- **5,571** validated tools (unified in **MCP format**)  
- **204** real-world apps covered, from **276** MCP servers  
- Long-horizon, constraint-dense tasks  
  - Avg. **28.5** tool-use rounds per task (**averaged across evaluated models**)  
- A **State Controller** that injects realistic failures & drift  
  (timeouts, rate limits, transient unavailability, etc.)  
- **Planner–Actor** agent framework  
  - ToolGym supports and releases data signals for **both**:  
    - **Planner**: deliberate reasoning, reflection, progress tracking, self-correction  
    - **Actor**: step-wise tool retrieval, invocation, and execution  
- **Data-efficient training (experiment)**: we show strong gains using only **1,170** curated training samples  
  (this number refers to the *training subset used in our experiments*, not the full scale/upper bound of ToolGym as a data engine)

---

## What is ToolGym?

ToolGym is designed to close the gap between “clean” function-calling benchmarks and **messy real-world tool ecosystems**. It supports both:

- **Benchmarking**: stress-test agents on long, multi-tool workflows under constraints and failures  
- **Data curation**: collect high-quality trajectories for training tool-using agents  

---

## Core components

### 1) Tool universe (MCP)

We curate and validate a large library of production-like tools, then standardize them under a unified **Model Context Protocol (MCP)** interface so agents can call tools consistently across apps and servers.

### 2) Tool retrieval index

Because open-world tool selection is the real challenge, ToolGym includes a retrieval layer so agents can search tools using natural language queries and load relevant tools on demand.

### 3) Task creation engine

ToolGym synthesizes **long-horizon, multi-tool workflows** that resemble real user requests:
- multi-step dependencies
- cross-app orchestration
- dense constraints (format, ordering, trade-offs, verification requirements, etc.)

### 4) State Controller (robustness testing)

To go beyond “happy-path” evaluation, ToolGym introduces a controllable middleware that can inject:
- tool-level failures (timeouts, temporary unavailability)
- state-level drift (corrupted/delayed results, expired sessions)
- constraint changes mid-execution (updated preferences, shifting deadlines)

### 5) Evaluation protocol

ToolGym evaluates agents on multiple axes, including:
- **Answer quality** (completeness, grounding)
- **Robustness** (schema compliance, recovery, flexibility)
- **Constraint following** (format + other constraints)
- **Planning** (goal decomposition, progress tracking, efficiency)

### 6) Planner–Actor decomposition

To better handle long-horizon objectives and error-prone tool ecosystems, we separate agent behavior into:
- **Planner**: global reasoning & self-correction (keeps the agent aligned over long trajectories)  
- **Actor**: efficient step-by-step execution (retrieval → tool call → observe → iterate)  

---

## Leaderboard

We maintain a public leaderboard for ToolGym.  
➡️ **[Leaderboard link](https://huggingface.co/spaces/ToolGym/leaderboard)**

---

## License

- This organization and its public repos are released under the **MIT** license unless otherwise specified in each repo.

---

## Contributing

Community contributions are welcome:
- Open a discussion: `/datasets/ToolGym/ToolGym/discussions`
- Submit PRs to the relevant repo (dataset / code / leaderboard Space)

---

## Contact

For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.