Spaces:

ToolGym
/

README

Running

App Files Files Community

xiziqiao commited on Jan 9

Commit

84ab7ca

verified ·

1 Parent(s): a7c6e7e

Update README.md

Browse files

Files changed (1) hide show

README.md +98 -7

README.md CHANGED Viewed

@@ -1,10 +1,101 @@
 ---
-title: README
-emoji: 🔥
-colorFrom: blue
-colorTo: indigo
-sdk: static
-pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

+# ToolGym
+**ToolGym** is an **open-world tool-using environment** for *scalable agent testing and data curation*.
+> Large tool pools • long-horizon workflows • wild constraints • unreliable tool states
+> 面向真实世界工具生态的可扩展评测与数据引擎
+---
+## Quick links
+- 🏆 **Leaderboard**: **(add link here)** — e.g., `/spaces/ToolGym/leaderboard`
+- 📦 **Dataset(s)**: `/datasets/ToolGym/ToolGym`
+- 💬 **Discussions**: `/datasets/ToolGym/ToolGym/discussions`
+- 📄 **Paper / Technical report**: **(add link here)**
+- 💻 **Code**: **(add link here)**
+---
+## Key stats
+- **5,571** validated tools (unified in **MCP format**)
+- **204** real-world apps covered, from **276** MCP servers
+- Long-horizon tasks with **wild, realistic constraints** (avg. **28.5** tool-use rounds per task)
+- A **State Controller** that injects realistic failures & drift (timeouts, rate limits, transient unavailability, etc.)
+- An evaluation protocol that scores **quality, robustness, constraint following, and planning**
+- **1,170** tool-use trajectories curated for instruction tuning / training
+(Stats and design details are summarized from our paper draft.) :contentReference[oaicite:0]{index=0}
 ---
+## What is ToolGym?
+ToolGym is designed to close the gap between “clean” function-calling benchmarks and **messy real-world tool ecosystems**. It supports both:
+- **Benchmarking**: stress-test agents on long, multi-tool workflows under constraints and failures
+- **Data curation**: automatically collect high-quality trajectories for training tool-using agents
 ---
+## Core components
+### 1) Tool universe (MCP)
+We curate and validate a large library of production-like tools, then standardize them under a unified **Model Context Protocol (MCP)** interface so agents can call tools consistently across apps and servers.
+### 2) Tool retrieval index
+Because open-world tool selection is the real challenge, ToolGym includes a retrieval layer so agents can search tools using natural language queries and load relevant tools on demand.
+### 3) Task creation engine
+ToolGym can synthesize **long-horizon, multi-tool workflows** that look like real user requests:
+- multi-step dependencies
+- cross-app orchestration
+- dense constraints (format, ordering, trade-offs, verification requirements, etc.)
+### 4) State Controller (robustness testing)
+To go beyond “happy-path” evaluation, ToolGym introduces a controllable middleware that can inject:
+- tool-level failures (timeouts, temporary unavailability)
+- state-level drift (corrupted/delayed results, expired sessions)
+- constraint changes mid-execution (updated preferences, shifting deadlines)
+### 5) Evaluation protocol
+ToolGym evaluates agents on multiple axes, including:
+- **Answer quality** (completeness, grounding)
+- **Robustness** (schema compliance, recovery, flexibility)
+- **Constraint following** (format + other constraints)
+- **Planning** (goal decomposition, progress tracking, efficiency)
+---
+## Leaderboard
+We maintain a public leaderboard for ToolGym.
+➡️ **Leaderboard link**: **(add link here)**
+If you use our leaderboard results, please cite the corresponding paper/technical report (link above).
+---
+## License
+- This organization and its public repos are released under the **MIT** license unless otherwise specified in each repo.
+---
+## Contributing
+Community contributions are welcome:
+- Open a discussion: `/datasets/ToolGym/ToolGym/discussions`
+- Submit PRs to the relevant repo (dataset / code / leaderboard Space)
+---
+## Contact
+For questions, collaborations, or leaderboard submissions, please open an issue/discussion or contact the maintainers via the links above.