renanserrano commited on
Commit
3d3bc40
·
verified ·
1 Parent(s): f03afc0

Update comparison table — BFCL, ToolBench, EnterpriseOps-Gym, tau-bench with links and richer descriptions

Browse files
Files changed (1) hide show
  1. README.md +16 -10
README.md CHANGED
@@ -34,6 +34,9 @@ tags:
34
  - openai
35
  - sandbox
36
  - docker
 
 
 
37
  pinned: true
38
  license: apache-2.0
39
  ---
@@ -44,7 +47,7 @@ A fully-functional HR simulation gym for training, evaluating, and benchmarking
44
 
45
  Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).
46
 
47
- Your agent gets a task ("schedule a phone screen", "approve a leave request", "onboard a new hire") and a real workplace with an HRMS, email, calendar, and team chat. Can it get the job done?
48
 
49
  ## 4 MCP Tool Servers, 1 Environment
50
 
@@ -163,15 +166,18 @@ with env:
163
 
164
  ## How SimLab HR Compares
165
 
166
- | | SimLab HR | EnterpriseOps-Gym | Toolathlon-GYM | tau-bench |
167
- |---|---|---|---|---|
168
- | HR-specific tasks | Dedicated | 1 of 8 domains | Partial (analytics data) | |
169
- | Real backing services | ✅ Frappe, MailHog, CalDAV, RocketChat | ❌ Mock APIs | ✅ PostgreSQL | ❌ Simulated |
170
- | MCP tool servers | 4 servers | REST APIs | 25 servers | |
171
- | Automated evaluation | ✅ Rubric judge | Expert-curated | Groundtruth scripts | Policy checks |
172
- | RL / Gymnasium support | ✅ OpenEnv | ❌ | ❌ | ❌ |
173
- | Custom MCP server plugging | ✅ via SimLab CLI | ❌ | | |
174
- | Task generation | ✅ API pipeline | ❌ | ❌ | ❌ |
 
 
 
175
 
176
  ## More Environments
177
 
 
34
  - openai
35
  - sandbox
36
  - docker
37
+ - toolbench
38
+ - swe-bench
39
+ - bfcl
40
  pinned: true
41
  license: apache-2.0
42
  ---
 
47
 
48
  Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and powered by [SimLab](https://github.com/collinear-ai/simlab).
49
 
50
+ Unlike single-API function-calling benchmarks like [BFCL](https://github.com/ShishirPatil/gorilla) or [ToolBench](https://github.com/OpenBMB/ToolBench), SimLab HR gives your agent a full workplace HRMS, email, calendar, and team chat and asks it to complete real multi-step HR workflows end-to-end.
51
 
52
  ## 4 MCP Tool Servers, 1 Environment
53
 
 
166
 
167
  ## How SimLab HR Compares
168
 
169
+ Most tool-use benchmarks evaluate function calling in isolation — single API calls with predefined schemas. SimLab HR tests whether your agent can actually get work done across a real enterprise environment.
170
+
171
+ | | SimLab HR | [BFCL](https://github.com/ShishirPatil/gorilla) | [ToolBench](https://github.com/OpenBMB/ToolBench) | [EnterpriseOps-Gym](https://github.com/ServiceNow/EnterpriseOps-Gym) | [tau-bench](https://github.com/sierra-research/tau2-bench) |
172
+ |---|---|---|---|---|---|
173
+ | What it tests | Multi-tool HR workflows end-to-end | Function call accuracy (single/parallel) | API discovery & chaining across 16k APIs | Enterprise planning across 8 domains | Customer service policy compliance |
174
+ | Real backing services | ✅ Frappe HRMS, MailHog, CalDAV, RocketChat | Schema validation only | API simulation | Mock APIs | ❌ Simulated |
175
+ | MCP tool servers | ✅ 4 servers | ❌ | ❌ | ❌ REST APIs | ❌ |
176
+ | Multi-step workflows | ✅ 10+ steps, cross-system | ❌ Single/parallel calls | Multi-hop chains | Avg 9 steps | ✅ Multi-turn |
177
+ | HR-specific | ✅ Dedicated | ❌ | ❌ | ✅ 1 of 8 domains | ❌ |
178
+ | Automated evaluation | ✅ Rubric-based LLM judge | ✅ AST matching | ✅ Pass rate + win rate | ✅ Expert-curated | ✅ Policy checks |
179
+ | RL / Gymnasium support | ✅ OpenEnv-compatible | ❌ | ❌ | ❌ | ❌ |
180
+ | Task generation | ✅ API pipeline | ❌ | ❌ | ❌ | ❌ |
181
 
182
  ## More Environments
183