Spaces:
Runtime error
Runtime error
Update comparison table — BFCL, ToolBench, EnterpriseOps-Gym, tau-bench with links and richer descriptions
Browse files
README.md
CHANGED
|
@@ -34,6 +34,9 @@ tags:
|
|
| 34 |
- openai
|
| 35 |
- sandbox
|
| 36 |
- docker
|
|
|
|
|
|
|
|
|
|
| 37 |
pinned: true
|
| 38 |
license: apache-2.0
|
| 39 |
---
|
|
@@ -44,7 +47,7 @@ A fully-functional HR simulation gym for training, evaluating, and benchmarking
|
|
| 44 |
|
| 45 |
Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
## 4 MCP Tool Servers, 1 Environment
|
| 50 |
|
|
@@ -163,15 +166,18 @@ with env:
|
|
| 163 |
|
| 164 |
## How SimLab HR Compares
|
| 165 |
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
|
| 169 |
-
|
|
| 170 |
-
|
|
| 171 |
-
|
|
| 172 |
-
|
|
| 173 |
-
|
|
| 174 |
-
|
|
|
|
|
|
|
|
|
|
|
| 175 |
|
| 176 |
## More Environments
|
| 177 |
|
|
|
|
| 34 |
- openai
|
| 35 |
- sandbox
|
| 36 |
- docker
|
| 37 |
+
- toolbench
|
| 38 |
+
- swe-bench
|
| 39 |
+
- bfcl
|
| 40 |
pinned: true
|
| 41 |
license: apache-2.0
|
| 42 |
---
|
|
|
|
| 47 |
|
| 48 |
Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).
|
| 49 |
|
| 50 |
+
Unlike single-API function-calling benchmarks like [BFCL](https://github.com/ShishirPatil/gorilla) or [ToolBench](https://github.com/OpenBMB/ToolBench), SimLab HR gives your agent a full workplace — HRMS, email, calendar, and team chat — and asks it to complete real multi-step HR workflows end-to-end.
|
| 51 |
|
| 52 |
## 4 MCP Tool Servers, 1 Environment
|
| 53 |
|
|
|
|
| 166 |
|
| 167 |
## How SimLab HR Compares
|
| 168 |
|
| 169 |
+
Most tool-use benchmarks evaluate function calling in isolation — single API calls with predefined schemas. SimLab HR tests whether your agent can actually get work done across a real enterprise environment.
|
| 170 |
+
|
| 171 |
+
| | SimLab HR | [BFCL](https://github.com/ShishirPatil/gorilla) | [ToolBench](https://github.com/OpenBMB/ToolBench) | [EnterpriseOps-Gym](https://github.com/ServiceNow/EnterpriseOps-Gym) | [tau-bench](https://github.com/sierra-research/tau2-bench) |
|
| 172 |
+
|---|---|---|---|---|---|
|
| 173 |
+
| What it tests | Multi-tool HR workflows end-to-end | Function call accuracy (single/parallel) | API discovery & chaining across 16k APIs | Enterprise planning across 8 domains | Customer service policy compliance |
|
| 174 |
+
| Real backing services | ✅ Frappe HRMS, MailHog, CalDAV, RocketChat | ❌ Schema validation only | ❌ API simulation | ❌ Mock APIs | ❌ Simulated |
|
| 175 |
+
| MCP tool servers | ✅ 4 servers | ❌ | ❌ | ❌ REST APIs | ❌ |
|
| 176 |
+
| Multi-step workflows | ✅ 10+ steps, cross-system | ❌ Single/parallel calls | ✅ Multi-hop chains | ✅ Avg 9 steps | ✅ Multi-turn |
|
| 177 |
+
| HR-specific | ✅ Dedicated | ❌ | ❌ | ✅ 1 of 8 domains | ❌ |
|
| 178 |
+
| Automated evaluation | ✅ Rubric-based LLM judge | ✅ AST matching | ✅ Pass rate + win rate | ✅ Expert-curated | ✅ Policy checks |
|
| 179 |
+
| RL / Gymnasium support | ✅ OpenEnv-compatible | ❌ | ❌ | ❌ | ❌ |
|
| 180 |
+
| Task generation | ✅ API pipeline | ❌ | ❌ | ❌ | ❌ |
|
| 181 |
|
| 182 |
## More Environments
|
| 183 |
|