xiziqiao commited on
Commit
84ab7ca
·
verified ·
1 Parent(s): a7c6e7e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -7
README.md CHANGED
@@ -1,10 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: README
3
- emoji: 🔥
4
- colorFrom: blue
5
- colorTo: indigo
6
- sdk: static
7
- pinned: false
 
 
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ToolGym
2
+
3
+ **ToolGym** is an **open-world tool-using environment** for *scalable agent testing and data curation*.
4
+
5
+ > Large tool pools • long-horizon workflows • wild constraints • unreliable tool states
6
+ > 面向真实世界工具生态的可扩展评测与数据引擎
7
+
8
+ ---
9
+
10
+ ## Quick links
11
+
12
+ - 🏆 **Leaderboard**: **(add link here)** — e.g., `/spaces/ToolGym/leaderboard`
13
+ - 📦 **Dataset(s)**: `/datasets/ToolGym/ToolGym`
14
+ - 💬 **Discussions**: `/datasets/ToolGym/ToolGym/discussions`
15
+ - 📄 **Paper / Technical report**: **(add link here)**
16
+ - 💻 **Code**: **(add link here)**
17
+
18
+ ---
19
+
20
+ ## Key stats
21
+
22
+ - **5,571** validated tools (unified in **MCP format**)
23
+ - **204** real-world apps covered, from **276** MCP servers
24
+ - Long-horizon tasks with **wild, realistic constraints** (avg. **28.5** tool-use rounds per task)
25
+ - A **State Controller** that injects realistic failures & drift (timeouts, rate limits, transient unavailability, etc.)
26
+ - An evaluation protocol that scores **quality, robustness, constraint following, and planning**
27
+ - **1,170** tool-use trajectories curated for instruction tuning / training
28
+
29
+ (Stats and design details are summarized from our paper draft.) :contentReference[oaicite:0]{index=0}
30
+
31
  ---
32
+
33
+ ## What is ToolGym?
34
+
35
+ ToolGym is designed to close the gap between “clean” function-calling benchmarks and **messy real-world tool ecosystems**. It supports both:
36
+
37
+ - **Benchmarking**: stress-test agents on long, multi-tool workflows under constraints and failures
38
+ - **Data curation**: automatically collect high-quality trajectories for training tool-using agents
39
+
40
  ---
41
 
42
+ ## Core components
43
+
44
+ ### 1) Tool universe (MCP)
45
+
46
+ We curate and validate a large library of production-like tools, then standardize them under a unified **Model Context Protocol (MCP)** interface so agents can call tools consistently across apps and servers.
47
+
48
+ ### 2) Tool retrieval index
49
+
50
+ Because open-world tool selection is the real challenge, ToolGym includes a retrieval layer so agents can search tools using natural language queries and load relevant tools on demand.
51
+
52
+ ### 3) Task creation engine
53
+
54
+ ToolGym can synthesize **long-horizon, multi-tool workflows** that look like real user requests:
55
+ - multi-step dependencies
56
+ - cross-app orchestration
57
+ - dense constraints (format, ordering, trade-offs, verification requirements, etc.)
58
+
59
+ ### 4) State Controller (robustness testing)
60
+
61
+ To go beyond “happy-path” evaluation, ToolGym introduces a controllable middleware that can inject:
62
+ - tool-level failures (timeouts, temporary unavailability)
63
+ - state-level drift (corrupted/delayed results, expired sessions)
64
+ - constraint changes mid-execution (updated preferences, shifting deadlines)
65
+
66
+ ### 5) Evaluation protocol
67
+
68
+ ToolGym evaluates agents on multiple axes, including:
69
+ - **Answer quality** (completeness, grounding)
70
+ - **Robustness** (schema compliance, recovery, flexibility)
71
+ - **Constraint following** (format + other constraints)
72
+ - **Planning** (goal decomposition, progress tracking, efficiency)
73
+
74
+ ---
75
+
76
+ ## Leaderboard
77
+
78
+ We maintain a public leaderboard for ToolGym.
79
+ ➡️ **Leaderboard link**: **(add link here)**
80
+
81
+ If you use our leaderboard results, please cite the corresponding paper/technical report (link above).
82
+
83
+ ---
84
+
85
+ ## License
86
+
87
+ - This organization and its public repos are released under the **MIT** license unless otherwise specified in each repo.
88
+
89
+ ---
90
+
91
+ ## Contributing
92
+
93
+ Community contributions are welcome:
94
+ - Open a discussion: `/datasets/ToolGym/ToolGym/discussions`
95
+ - Submit PRs to the relevant repo (dataset / code / leaderboard Space)
96
+
97
+ ---
98
+
99
+ ## Contact
100
+
101
+ For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.