xiziqiao commited on
Commit
ccf61fc
·
verified ·
1 Parent(s): 84ab7ca

Update README.md

Browse files

for more accurate description

Files changed (1) hide show
  1. README.md +26 -14
README.md CHANGED
@@ -1,3 +1,7 @@
 
 
 
 
1
  # ToolGym
2
 
3
  **ToolGym** is an **open-world tool-using environment** for *scalable agent testing and data curation*.
@@ -9,7 +13,7 @@
9
 
10
  ## Quick links
11
 
12
- - 🏆 **Leaderboard**: **(add link here)** — e.g., `/spaces/ToolGym/leaderboard`
13
  - 📦 **Dataset(s)**: `/datasets/ToolGym/ToolGym`
14
  - 💬 **Discussions**: `/datasets/ToolGym/ToolGym/discussions`
15
  - 📄 **Paper / Technical report**: **(add link here)**
@@ -17,16 +21,20 @@
17
 
18
  ---
19
 
20
- ## Key stats
21
 
22
  - **5,571** validated tools (unified in **MCP format**)
23
  - **204** real-world apps covered, from **276** MCP servers
24
- - Long-horizon tasks with **wild, realistic constraints** (avg. **28.5** tool-use rounds per task)
25
- - A **State Controller** that injects realistic failures & drift (timeouts, rate limits, transient unavailability, etc.)
26
- - An evaluation protocol that scores **quality, robustness, constraint following, and planning**
27
- - **1,170** tool-use trajectories curated for instruction tuning / training
28
-
29
- (Stats and design details are summarized from our paper draft.) :contentReference[oaicite:0]{index=0}
 
 
 
 
30
 
31
  ---
32
 
@@ -34,8 +42,8 @@
34
 
35
  ToolGym is designed to close the gap between “clean” function-calling benchmarks and **messy real-world tool ecosystems**. It supports both:
36
 
37
- - **Benchmarking**: stress-test agents on long, multi-tool workflows under constraints and failures
38
- - **Data curation**: automatically collect high-quality trajectories for training tool-using agents
39
 
40
  ---
41
 
@@ -51,7 +59,7 @@ Because open-world tool selection is the real challenge, ToolGym includes a retr
51
 
52
  ### 3) Task creation engine
53
 
54
- ToolGym can synthesize **long-horizon, multi-tool workflows** that look like real user requests:
55
  - multi-step dependencies
56
  - cross-app orchestration
57
  - dense constraints (format, ordering, trade-offs, verification requirements, etc.)
@@ -71,6 +79,12 @@ ToolGym evaluates agents on multiple axes, including:
71
  - **Constraint following** (format + other constraints)
72
  - **Planning** (goal decomposition, progress tracking, efficiency)
73
 
 
 
 
 
 
 
74
  ---
75
 
76
  ## Leaderboard
@@ -78,8 +92,6 @@ ToolGym evaluates agents on multiple axes, including:
78
  We maintain a public leaderboard for ToolGym.
79
  ➡️ **Leaderboard link**: **(add link here)**
80
 
81
- If you use our leaderboard results, please cite the corresponding paper/technical report (link above).
82
-
83
  ---
84
 
85
  ## License
@@ -98,4 +110,4 @@ Community contributions are welcome:
98
 
99
  ## Contact
100
 
101
- For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.
 
1
+ ---
2
+ sdk: static
3
+ ---
4
+
5
  # ToolGym
6
 
7
  **ToolGym** is an **open-world tool-using environment** for *scalable agent testing and data curation*.
 
13
 
14
  ## Quick links
15
 
16
+ - 🏆 **Leaderboard**: **(add link here)**
17
  - 📦 **Dataset(s)**: `/datasets/ToolGym/ToolGym`
18
  - 💬 **Discussions**: `/datasets/ToolGym/ToolGym/discussions`
19
  - 📄 **Paper / Technical report**: **(add link here)**
 
21
 
22
  ---
23
 
24
+ ## Key highlights
25
 
26
  - **5,571** validated tools (unified in **MCP format**)
27
  - **204** real-world apps covered, from **276** MCP servers
28
+ - Long-horizon, constraint-dense tasks
29
+ - Avg. **28.5** tool-use rounds per task (**averaged across evaluated models**)
30
+ - A **State Controller** that injects realistic failures & drift
31
+ (timeouts, rate limits, transient unavailability, etc.)
32
+ - **Planner–Actor** agent framework
33
+ - ToolGym supports and releases data signals for **both**:
34
+ - **Planner**: deliberate reasoning, reflection, progress tracking, self-correction
35
+ - **Actor**: step-wise tool retrieval, invocation, and execution
36
+ - **Data-efficient training (experiment)**: we show strong gains using only **1,170** curated training samples
37
+ (this number refers to the *training subset used in our experiments*, not the full scale/upper bound of ToolGym as a data engine)
38
 
39
  ---
40
 
 
42
 
43
  ToolGym is designed to close the gap between “clean” function-calling benchmarks and **messy real-world tool ecosystems**. It supports both:
44
 
45
+ - **Benchmarking**: stress-test agents on long, multi-tool workflows under constraints and failures
46
+ - **Data curation**: collect high-quality trajectories for training tool-using agents
47
 
48
  ---
49
 
 
59
 
60
  ### 3) Task creation engine
61
 
62
+ ToolGym synthesizes **long-horizon, multi-tool workflows** that resemble real user requests:
63
  - multi-step dependencies
64
  - cross-app orchestration
65
  - dense constraints (format, ordering, trade-offs, verification requirements, etc.)
 
79
  - **Constraint following** (format + other constraints)
80
  - **Planning** (goal decomposition, progress tracking, efficiency)
81
 
82
+ ### 6) Planner–Actor decomposition
83
+
84
+ To better handle long-horizon objectives and error-prone tool ecosystems, we separate agent behavior into:
85
+ - **Planner**: global reasoning & self-correction (keeps the agent aligned over long trajectories)
86
+ - **Actor**: efficient step-by-step execution (retrieval → tool call → observe → iterate)
87
+
88
  ---
89
 
90
  ## Leaderboard
 
92
  We maintain a public leaderboard for ToolGym.
93
  ➡️ **Leaderboard link**: **(add link here)**
94
 
 
 
95
  ---
96
 
97
  ## License
 
110
 
111
  ## Contact
112
 
113
+ For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.