Spaces:

ToolGym
/

README

Running

App Files Files Community

xiziqiao commited on Jan 9

Commit

ccf61fc

verified ·

1 Parent(s): 84ab7ca

Update README.md

Browse files

for more accurate description

Files changed (1) hide show

README.md +26 -14

README.md CHANGED Viewed

@@ -1,3 +1,7 @@
 # ToolGym
 **ToolGym** is an **open-world tool-using environment** for *scalable agent testing and data curation*.
@@ -9,7 +13,7 @@
 ## Quick links
-- 🏆 **Leaderboard**: **(add link here)** — e.g., `/spaces/ToolGym/leaderboard`
 - 📦 **Dataset(s)**: `/datasets/ToolGym/ToolGym`
 - 💬 **Discussions**: `/datasets/ToolGym/ToolGym/discussions`
 - 📄 **Paper / Technical report**: **(add link here)**
@@ -17,16 +21,20 @@
 ---
-## Key stats
 - **5,571** validated tools (unified in **MCP format**)
 - **204** real-world apps covered, from **276** MCP servers
-- Long-horizon tasks with **wild, realistic constraints** (avg. **28.5** tool-use rounds per task)
-- A **State Controller** that injects realistic failures & drift (timeouts, rate limits, transient unavailability, etc.)
-- An evaluation protocol that scores **quality, robustness, constraint following, and planning**
-- **1,170** tool-use trajectories curated for instruction tuning / training
-(Stats and design details are summarized from our paper draft.) :contentReference[oaicite:0]{index=0}
 ---
@@ -34,8 +42,8 @@
 ToolGym is designed to close the gap between “clean” function-calling benchmarks and **messy real-world tool ecosystems**. It supports both:
-- **Benchmarking**: stress-test agents on long, multi-tool workflows under constraints and failures
-- **Data curation**: automatically collect high-quality trajectories for training tool-using agents
 ---
@@ -51,7 +59,7 @@ Because open-world tool selection is the real challenge, ToolGym includes a retr
 ### 3) Task creation engine
-ToolGym can synthesize **long-horizon, multi-tool workflows** that look like real user requests:
 - multi-step dependencies
 - cross-app orchestration
 - dense constraints (format, ordering, trade-offs, verification requirements, etc.)
@@ -71,6 +79,12 @@ ToolGym evaluates agents on multiple axes, including:
 - **Constraint following** (format + other constraints)
 - **Planning** (goal decomposition, progress tracking, efficiency)
 ---
 ## Leaderboard
@@ -78,8 +92,6 @@ ToolGym evaluates agents on multiple axes, including:
 We maintain a public leaderboard for ToolGym.
 ➡️ **Leaderboard link**: **(add link here)**
-If you use our leaderboard results, please cite the corresponding paper/technical report (link above).
 ---
 ## License
@@ -98,4 +110,4 @@ Community contributions are welcome:
 ## Contact
-For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.

+---
+sdk: static
+---
 # ToolGym
 **ToolGym** is an **open-world tool-using environment** for *scalable agent testing and data curation*.
 ## Quick links
+- 🏆 **Leaderboard**: **(add link here)**
 - 📦 **Dataset(s)**: `/datasets/ToolGym/ToolGym`
 - 💬 **Discussions**: `/datasets/ToolGym/ToolGym/discussions`
 - 📄 **Paper / Technical report**: **(add link here)**
 ---
+## Key highlights
 - **5,571** validated tools (unified in **MCP format**)
 - **204** real-world apps covered, from **276** MCP servers
+- Long-horizon, constraint-dense tasks
+  - Avg. **28.5** tool-use rounds per task (**averaged across evaluated models**)
+- A **State Controller** that injects realistic failures & drift
+  (timeouts, rate limits, transient unavailability, etc.)
+- **Planner–Actor** agent framework
+  - ToolGym supports and releases data signals for **both**:
+    - **Planner**: deliberate reasoning, reflection, progress tracking, self-correction
+    - **Actor**: step-wise tool retrieval, invocation, and execution
+- **Data-efficient training (experiment)**: we show strong gains using only **1,170** curated training samples
+  (this number refers to the *training subset used in our experiments*, not the full scale/upper bound of ToolGym as a data engine)
 ---
 ToolGym is designed to close the gap between “clean” function-calling benchmarks and **messy real-world tool ecosystems**. It supports both:
+- **Benchmarking**: stress-test agents on long, multi-tool workflows under constraints and failures
+- **Data curation**: collect high-quality trajectories for training tool-using agents
 ---
 ### 3) Task creation engine
+ToolGym synthesizes **long-horizon, multi-tool workflows** that resemble real user requests:
 - multi-step dependencies
 - cross-app orchestration
 - dense constraints (format, ordering, trade-offs, verification requirements, etc.)
 - **Constraint following** (format + other constraints)
 - **Planning** (goal decomposition, progress tracking, efficiency)
+### 6) Planner–Actor decomposition
+To better handle long-horizon objectives and error-prone tool ecosystems, we separate agent behavior into:
+- **Planner**: global reasoning & self-correction (keeps the agent aligned over long trajectories)
+- **Actor**: efficient step-by-step execution (retrieval → tool call → observe → iterate)
 ---
 ## Leaderboard
 We maintain a public leaderboard for ToolGym.
 ➡️ **Leaderboard link**: **(add link here)**
 ---
 ## License
 ## Contact
+For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.